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Foreword 


The mantra of “Big Data” — the two capital letters in this expression epitomise the massive 
nature of the data involved — has increasingly gained traction in recent years. When this 
mantra first appeared, it had the aura of academic disciplines, and almost every sphere 
of business began dipping into the sea of big data. During that trail-blazing period, most 
scholars in the social and human sciences, owing largely to their academic training, felt 
immobilised and technologically hapless at the prospect of big data. This point is poignantly 
contextualised in the current book at the beginning of Chapter 2: “The very notion of big data 
creeping into their research spaces casts an intimidating shadow over traditional humanists 
and social scientists, who may fear human behaviour being reduced to mere mathematical 
models” (p. 19). Much information about the latter would emerge from the Internet and 
from big commercial publishing houses as is still the case even now. Again, during that 
period, there was not the slightest chance that any scholar from these two cognate sciences, 
least of all scholars from the South African context, would ever dabble in the nascent field 
of big data. But not anymore. This is the background against which the current book, aptly 
titled, “Reinventing the Social Scientist and Humanist in the Era of Big Data: A Perspective 
from South African Scholars”, should be viewed. Not only is this book’s title apposite, but 
the putting together of the book itself is a fitting and welcome scholarly event in the South 
African higher education ecosystem, and more so given the related mantra of the Fourth 
Industrial Revolution (4IR) — of which big data is an integral part (see Chaka 2019; Chaka 
forthcoming) — which is gaining currency in the South African higher education system. 

Given the points highlighted above, a pertinent question to pose is: what role does 
big data have to play in the social and human sciences? Before delving into and providing a 
bespoke answer that the book attempts to offer, I venture to say that big data has a role to 
play in every aspect of the social and human sciences. In the main, the book is “about big data 
aimed specifically at academics in the humanities and social sciences” (p. 1) and is written 
by three scholars from South Africa, and dare I add, by three scholars from the global South. 
Overall, the book boasts ten chapters whose lowdown has been eloquently captured in the 
Introduction. One of the telling points of the book is its brutal honesty about the fact that 
despite concerted efforts by South African scholars to create a nexus between big data and 
the digital humanities (DH), the applications of big data in the “traditional humanities and 
social sciences” (p. 3) are, at best, few and far between, and at worst, “an alien phenomenon 
at local universities” (p. 3). In fact, argues the book, at these universities, big data analytics 
is almost the exclusive preserve of computer science disciplines. 


Launching its first chapter by delineating the fuzziness that characterises the 
genealogy of big data and by mapping a historical trajectory of “(big) data storage” (p. 
7), the book meticulously provides its well-articulated successive themes and sub-themes 
in an integrated whole. A few examples are “Iwo dominant metaphors” (p. 21); “Digital 
humanists and computational social scientists” (p. 24); “Big Data, big despair: Myths 
debunked and lessons learned” (p. 29); “The nitty-gritty: Big data infrastructure” (p. 98); 
“The Hadoop ecosystem” (p. 110); and “Marrying (big) data science and the humanities 
and social sciences” (p. 124). It then rounds off its discussion by presenting a real-world 
big data-based chapter to demonstrate how some of the aspects of big data may be applied 
to tweets (about the word Afrikaner) harvested from Twitter as one example of a big data 
generating platform. The book could not have come at a better moment for advancing 
big data scholarship for human and social sciences and for experimenting with datafication 
(Chaka 2019) as it pertains to these two cognate sciences. 


Chaka Chaka 
December 2019 


Introduction 


Will large quantities of data transform how we study human communication and 


culture, or narrow the palette of research options and alter what ‘research’ means? 


— Danah boyd and Kate Crawford (2012:663) 


One of the authors of this book encountered the concept of a ‘data scientist unicorn’ for 
the first time in a position article on (big) data analytics in 2015.' She was intrigued by the 
romantic notion that a data scientist was expected to be a wunderkind in the sense of being 
exceptionally well versed in all the myriad areas that encompass data science. Her fascination 
with the idea of a data scientist as being some kind of mythical creature stemmed partly 
from the fact that she is a linguist housed in an English department in a humanities faculty; 
as a linguist she is interested, amongst other things, in the (covert) effects of metaphorical 
framing of ideas, practices, and disciplines that are regarded as emergent. Since data scientists 
are increasingly working in the arena of big data, she also began to explore how big data 
itself is conceptualised by the mass media and academics, and what she found was that 
this phenomenon is variously described as “portentous” (Lupton 2015:2) and “perverse”, 
(Lupton 2015:2) or as “gold” (van Dijck 2014:199) and “oil” (Dean 2014:9). As insightful as 
these descriptions were, they did not tell her anything about the ontological attributes of big 
data, and so she turned to two colleagues employed at the same institution for answers. One 
of these colleagues is also a humanist, but with a great deal of experience in using big data 
tools to conduct research in the fields of complex networks and information technology; the 
other is a data scientist working towards developing technologies for extraction and analysis 
of sentiment from microblogging sites by employing a combination of natural language 
processing, text mining, and machine learning techniques. As these three academics began 
to discover some of the advantages of big data research, they also began to wonder if and how 
big data could be harnessed by traditional humanists and social scientists in South Africa, 
particularly in light of the fact that many of these scholars remain sceptical about its ultimate 
import. It is through countless debates and emails that they decided to write a book about 
big data aimed specifically at academics in the humanities and social sciences. 

Big data has changed virtually every aspect of society over the last two decades. In 
business, big data has facilitated better marketing campaigns and monitoring of sales and 
manufacturing processes (Davenport 2014). Most large international corporations such as 


1 “Chasing the data science unicorn by David Stodder (https://tdwi.org/articles/2015/01/06/chasing- 
the-data-science-unicorn.aspx). 
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Amazon, Google, Facebook, Coca-Cola and Walmart use big data in some way or another, 
as do major banks and financial institutions. 

Governments have also seized opportunities created by big data. In March 2012, 
the United States (US) government launched their Big Data Research and Development 
Initiative, involving various agencies in infrastructure development to store, manage, and 
analyse large-scale data (Lazar 2012:47; Chen, Mao & Liu 2014:175). Big data also played 
an important role in Barack Obama’s election campaign to rally individual voters (Jin, Wah, 
Cheng & Wang 2015: 61). Jin et al. (2015:60) anticipate “that future economic and political 
competitions among countries will be based on exploiting the potential of big data, among 
other traditional aspects. In short, the research and applications of big data are of strategic 
importance and significance for improving the competitiveness of any country”. The 
Defense Advanced Research Projects Agency (DARPA) — responsible for the creation of the 
Internet — is one of the organisations once again involved in this technological development 
through one of their programmes called ADAMS (Anomaly detection at multiple scales). 
The US Department of Defense has a programme called MAPD (Mathematics for the 
analysis of petascale data) (Lazar 2012:48), which has been developed to extract insights 
from huge scientific datasets. The National Security Agency (NSA) in the US exploits big 
data through its Planning Tool for Resource Integration, Synchronization and Management 
(PRISM) programme, while the United Kingdom (UK) does the same through Tempora 
(Lyon 2014:2). Even the United Nations — which Davenport (2014:17) notes is not known 
for technological innovation — has a big data programme called HunchWorks. 

In science, big data has resulted in what some call the fourth paradigm?’ (Park & 
Leydesdorff 2013:757; Abreu & Acker 2013:549; Hitzler & Janowicz 2013:233). Big data 
has become so important that journals have emerged over the past two decades that focus 
specifically on this field. These include Annals of Data Science, International Journal of Data 
Science and Analytics, EP] Data Science, GigaScience, and Big Data & Society. 

From the outset, it should be noted that this book neither hails big data as the 
game changer of the twenty-first century nor dismisses the phenomenon outright. Instead, it 
interrogates the multiple facets of and (controversial) issues surrounding big data with specific 
reference to the humanities and social sciences. Amongst other things, the book challenges 
the notion that big data is a monolith; asks critical questions about its epistemology; and 
delves into important aspects of the phenomenon related to its so-called objectivity and 
accuracy, emerging ethical concerns surrounding its design and implementation, and very 
real anxieties about the new digital divide it is creating. In a sense the book also seeks to allay 
the fears that traditional humanists and social scientists might have about big data signalling 


2 The first paradigm was empirical science, the second theoretical science, and the third computer-driven 


science (Chen & Zhang 2014:315). 
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the end of theory and resulting in a data-driven world in which individuals simply keep 
generating new data before developing new algorithms to process it. 

We would like to clarify why the book’s sub-title includes the words ‘a perspective 
from South African scholars’. Although we refer to the situation in South Africa as it 
pertains to the advent of the digital age and to links between big data and the humanities/ 
social sciences, the book does not reflect an explicit South African contextualisation across 
all chapters. Selecting a sub-title such as ‘a South African perspective’ would have been 
misleading on our part. The wording in the sub-title reflects the fact that we are South 
African scholars writing about big data. In addition, and where possible, we have discussed 
technology and big data as they relate to the South African context, and the final chapter 
showcases a big data project that focuses specifically on a South African phenomenon. 
The main reason why each chapter does not constitute a more deliberate South African 
contextualisation is that, aside from efforts by South African scholars to link big data and 
the digital humanities’, the application of big data in the traditional humanities and social 
sciences is, for the most part, an alien phenomenon at local universities. Indeed, at these 
universities, big data analytics is typically taught in computer science/information systems 
departments or faculties, for example.* 

Chapter 1 provides a brief history of big data by examining the foundational narrative 
behind it. This chapter puts paid to the notion that big data does not have a lengthy past 
and offers the caveat that all scholars — no matter what their discipline — should not ignore 
big datas history because “[even] in a petabyte world, history matters” (Barnes 2013:298). 

In Chapter 2, the focus shifts to positioning big data in the (digital) humanities 
and (computational) social sciences. It considers how, if at all, big data has affected the 
epistemologies of these disciplines. This chapter also evaluates the now widely adopted 
practice of collaboration between big data scientists and humanists/social scientists, since 
each type of researcher possesses knowledge and skills that the other does not have. 

Chapter 3 contemplates what some scholars have referred to as big data hubris. 
To this end, it begins with a description of one of the most significant big data ‘fails’ of 
the millennium before attempting to identify and resolve some of the numerous myths 
that surround big data, myths that have, at least to some degree, reduced the appetite that 
scholars in the humanities and social sciences may have for pursuing big data studies in 


3 Voices from the South (2018), edited by Amanda du Preez and published by AOSIS, is an 
excellent volume that illustrates how humanities and arts scholars in South Africa are utilising 
data-driven research. 


4 Examples are the University of Pretoria, the University of the Western Cape, and the University of 
the Witwatersrand. Interestingly, the latter university does offer an MA degree through its School of 
Social Sciences in the Humanities Faculty which focuses on data-driven methods in the humanities 
and social sciences. 
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their fields. Each common misconception is linked to a particular lesson that humanists and 
social scientists may harness to avoid some of the errors that have been committed by big 
data scholars in the past. 

Ethical considerations constitute the focus of Chapter 4, and are reviewed against 
the background of some disturbing big data fiascos that only serve to signal the ongoing 
tendency within the big data ecosystem to neglect or entirely ignore ethical research. 

Big data visualisation decisions have to be made in terms of ethical guidelines, and 
a detailed discussion of visualisation is contained in Chapter 5. A chapter has been devoted 
to this element given the complex cognitive, social, and emotional risks associated with 
graphical representations of data. The chapter has been written mainly with humanists in 
mind, since some scholars are of the view that data visualisation and the humanities cannot 
be reconciled. Social scientists and scholars in other disciplines should nevertheless find the 
information contained in the chapter quite useful. 

In Chapter 6, the notion of big data power is probed against the background of 
studies conducted by humanists and social scientists. Although the chapter touches on big 
data’s dark side, it also reflects on the positive contributions made to big data research by 
scholars in fields such as psychology, literature, history, political science, and the spatial 
humanities. While the chapter takes into account research that revolves around large volumes 
of data, it also challenges what big data scientists regard as ‘big’ since data size is a relative 
concept in the humanities and social sciences. 

Since traditional humanists and social scientists tend to be qualitative researchers, 
Chapter 7 critically appraises the various qualitative data analysis software (QDAS) tools 
they may utilise to conduct big data research. The chapter dispels a frequently held myth 
that the use of such tools eliminates the need for human interpretation of data. 

Chapter 8 focuses on the various components that make up what big data scientists 
refer to as big data architecture. It is necessary to take a closer look at the more technical 
aspects of the big data ecosystem since QDAS, while able to manage fairly large datasets, 
cannot adequately accommodate big data as defined by big data scientists. 

In the world of big data, data science has become an important buzzword, but needs 
to be interrogated in light of the fact that it is a nascent rather than established discipline 
that is defined in ways that tend to perplex academics from different fields. Chapter 9 is thus 
devoted to deconstructing what data science and data scientists embody. The chapter also 
pays some attention to best practices for academic data science with a view to establishing 
a socially informed, “human-centred” (Neff, Tanweer, Fiore-Gartland & Osburn 2017:95) 
data science that also involves ethical thinking. 

Chapter 10 focuses on a specific big data study in the humanities in order to 
illustrate how big data may be employed to answer specific research questions. The study 
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revolves around considering how tweets — both in and outside South Africa — have depicted 
Afrikaners in light of recent social and political events such as racist incidents and the 
government’s decision to expropriate land without compensation. 

Currently, big data is a fraught phenomenon that either excites academics or 
generates feelings of anxiety and confusion. It is entirely up to scholars whether or not they 
would like to carry out big data research, and the various issues in this book may help them 
critically interrogate the phenomenon and understand its complex and evolving nuances. 


Chapter 1 


The (fuzzy) origins of big data and the 
dangers of ignoring history 


... the past remains potent for big data and ... proponents ignore it at their peril. 
— Trevor Barnes (2013:297) 


Does big data have a past or is it a phenomenon that simply exploded onto the global 
electronic scene with the advent of the microprocessor, the Internet, and the World Wide 
Web? If big data does indeed have a foundational narrative, what can we glean from this 
narrative, if anything? Does big datas emergence constitute a revolution or an evolution? If it 
is “a disruptive force” (Bollier 2010:28) to be reckoned with, then how should traditional 
humanists and social scientists view its place in their research environments? We attempt to 
answer these questions in this first chapter (and of course in subsequent ones) by navigating 
through the somewhat cloudy waters of big data’s past. 


1.1 A messy affair 


“Big data” is a relative term depending on who is discussing it. 
— Keith Foote (2017:1) 


Any researcher who attempts to trace the history of big data will find this exercise both 
challenging and frustrating, as it is a phenomenon that simply does not fit into neat boxes. 
In the absence of academic papers that attempt to accurately chart the origins of big data, a 
survey of (visual) timelines of this phenomenon on the Internet has signalled that individuals 
in different fields hold opposing views on the brainchild behind it. Some recognise Roger 
Mouglas (director of market research at O’Reilly Media) as having coined the term big 
data in 2005, while others argue that credit belongs to John Mashey (1998), who was chief 
computer scientist at Silicon Graphics, Incorporated (SGI) in the 1990s. Still others have 
pointed out that academic references to the term appeared for the first time in studies by 
Cox and Ellsworth (1997), Weiss and Indurkhya (1998), and Diebold (2003). To muddy 
the waters even further, the origins of big data have variously been described as “intriguing 
and a bit murky” (Diebold 2012:2), “brief” (Marr 2015a:1), “uncertain” (Gandomi & 
Haider 2015: 138), and even “disputed” (Kaplan 2015:2). Since the history of big data is 


The (fuzzy) origins of big data and the dangers of ignoring history 


somewhat messy, with researchers and technology journalists vehemently contesting the role 
specific individuals have played in its evolution, it is perhaps easier to think about it in terms 
of (big) data storage before and after the advent of the digital age; the emergence of statistical 
analysis and business intelligence (BI); the digital revolution; as well as significant events 
surrounding big data, with some acknowledgment going to Gil Press (2013) and Bernard 
Marr (2015a) who have identified key milestones in these areas. 


1.2 The history of (big) data storage 


Although it might be easy to forget, our increasing ability to store and analyze 
information has been a gradual evolution. — Bernard Marr (2015a:1) 


Attempting to uncover the origins of big data would not be complete without first 
considering how people through the ages have stored information they consider to be 
valuable. What we know is that the history of data storage goes back to approximately 
18000 BCE when Palaeolithic tribespeople in East Africa made notches on sticks or bones — 
presumably to facilitate counting when it came to their food supplies and trading activities 
(Igarashi, Altman, Funada & Kamiyama 2014:3). A fascinating example of an ancient 
storage device is the Caral quipu, a system of knotted strings that archaeologists believe 
may have been used 5000 years ago by the Incas to store massive amounts of data about 
their culture and civilisation (Chrisomalis 2009:67). Between 2500 and 2400 BCE, dwellers 
in the ancient city of Ebla (an early kingdom in Syria) used argyle tablets to store huge 
volumes of information pertaining to their economic, diplomatic, and commercial activities 
(Kaplan & di Lenardo 2017:4). Other ancient storage tools include coins (thought to have 
been used for the first time in the sixth or fifth century BCE) to store economic data and 
the Antikythera mechanism, which scholars speculate might be an analogue astronomical 
computer the Greeks used in circa 82 BCE to track planetary movements. Big data storage 
was first attempted by the Babylonians (in circa 2400 BCE) and the Alexandrians (between 
200 BCE-48 CE) when they constructed libraries to store large collections of tablets and 
scrolls, respectively. From 130-115 BCE, Rome’s pontifex maximus, Publius Mucius Scaevola, 
had his archives stored in the Annales maximi in order to record details about events such as 
battles and famines (Forsythe 2012). Unlike argyle tablets which could be erased with ease, 
the Annales were more durable in terms of wear and tear, showing that from a technological 
perspective, storage devices improved over time (Kaplan & di Lenardo 2017:4). 

What all these ancient devices have in common “is their capacity to deal with an 
open-ended stream of information and reorganize it to fit a given information paradigm” 
(Kaplan & di Lenardo 2017: 3) — they are “regulated representations ... governed by a set 
of production and usage rules” (Kaplan & di Lenardo 2017:3-4). A coin, for example, is 
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a regulated representation reflecting information about a particular civilisation’s economy, 
trade, and social organisation, amongst other things. 

When it comes to the Renaissance period in Europe, what Kaplan and di Lenardo 
(2017:3) call data acceleration became pronounced: “[not] only were new editions of ancient 
texts starting to be printed and circulated but also a deluge of ‘how-to books’ explaining 
previously secret arts and methods ... This sudden increase in knowledge and exposure to new 
practices created a well-documented feeling of information overload” (Kaplan & di Lenardo 
2017:3). In terms of technology, the Renaissance period is noted for improvements in search 
and retrieval; in this respect, scholars point to the invention of indexes, chronologies, and 
accounting tables (Kaplan & di Lenardo 2017:3). A major step in the development of data 
storage was the evolution of the printing industry which contributed significantly to open- 
ended streams of data which then became part of the data deluge (Kaplan & di Lenardo 
2017:2). The Gutenberg Press invented around 1440 by Johannes Gutenberg of Germany 
is often hailed as the first movable type printing press of its kind. However, Chinese artisan 
Bi Sheng is credited with having invented a printing press out of Chinese porcelain between 
1041 and 1048 (Gunaratne 2001:467). 

The eighteenth and nineteenth centuries are marked by “shared knowledge systems” 
as well as by “the rise of standardization [which] reinforced the idea that the fate of every 
dataset is to become, sooner or later, a shared resource by which new predictions and patterns 
can be established” (Kaplan & di Lenardo 2017:11). In the eighteenth century, rapid 
developments occurred in the field of lexicography which saw the publication of dictionaries 
such as Encyclopédie edited by Denis Diderot (1751) and Samuel Johnson's A Dictionary of the 
English Language published on 4 April 1755. These dictionaries “[illustrated] a growing need 
to not just collect but classify, categorise and order information to make it both meaningful 
and useful” (Robertson & Travaglia 2015:2), although it must be added that eighteenth- 
century dictionaries were not particularly “enlightened” (McIntosh 1998:3), since they 
were “more likely to represent the subjective impressions and prejudices of the editor or 
his sources than the objective documentation of language which became a feature of work 
from the mid- to late nineteenth century” (Simpson 1989:181-182). The first Significant 
industrial application of data storage in the eighteenth century may be attributed to Basile 
Bouchon, a textile worker in Lyon, who invented a perforated paper tape mechanism for 
storing patterns to be used on cloth in 1725. By the early part of the nineteenth century, 
which is also referred to as the pre-digital era (Robertson & Travaglia 2015:2), Joseph Marie 
Jacquard had refined Bouchon’s device with a view to further simplifying the process of 
manufacturing textiles (1804). In terms of the automation of computers, historians contend 
that huge strides were made in 1822 when Charles Babbage proposed the Difference Engine 
to perform differential equations and again in 1837 when Babbage described a mechanical 


The (fuzzy) origins of big data and the dangers of ignoring history 


computer called the Analytical Engine. What makes this computer particularly significant is 
that its co-designer was Augusta Ada Byron King, Countess of Lovelace (1815-1852), who 
is regarded as the first computer programmer (Stanley 2016). 

One of the most important contributions made to the evolution of digital computers 
and artificial intelligence was that of George Boole, who was an English philosopher, 
mathematician, and logician. In Laws of thought (1854), Boole began developing an algebra 
for logical inference, and today we still employ Boolean algebra which employs binary 
numbers (that is, 0 and 1) to operate computers. 

When it comes to the twentieth century, methods for storing large amounts of data 
included Fritz Pfleumer’s magnetic tape invented in 1928 (Levaux 2017), the Selectron tube 
(1946) which could hold between 32 and 512 bytes of data, the hard disk drive developed by 
IBM in 1956, the cassette tape produced by the Phillips Company in 1962, and the floppy 
disk drive conceived in 1967 by Alan Shugart who worked for IBM. By the 1990s, floppy 
disks could store approximately 250 megabytes of data. In the 1980s and 1990s, compact 
disks (CDs) and digital versatile discs (DVDs) dominated the global market until universal 
serial bus (USB) flash drives and secure digital (SD) cards made their way onto the scene 
in 2000. 

Currently we are reaping the benefits of network-based or cloud computing for 
processing, storing, and distributing big data. The origins of this technology can be traced 
back to the 1960s when Joseph Carl Robnett Licklider introduced the concept of an 
“intergalactic computer network” (Kaufman 2009:61) at the Advanced Research Project 
Agency (ARPA).’ However, others dispute this, arguing that cloud computing made its 
debut in 2006 when former Google CEO Eric Schmidt referred to it at a search engine 
conference (Regalado 2011:4). The current prediction is that the shape of cloud computing 
will change radically in the near future with some “[envisioning] a hybrid model where 
cloud users begin deploying small-scale data [centres] in strategic geographic locations 
that move data processing capabilities closer to the user — but are still centrally managed” 
(Froehlich 2017:1). 


5 In an office memorandum sent to his colleagues in 1963, Licklider referred to this experimental 
computer network as an open networking system that would be “the main and essential medium of 
informational interaction for governments, institutions, corporations and individuals” (cf. Garreau 
2006:22). Needless to say, this version of Lidlicker’s vision called ARPANET was the forerunner of 
the Internet. 
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1.3 The emergence of statistical analysis 


Errors using inadequate data are much less than those using no data at all. 
— Charles Babbage (circa 1850) 


In his essay on the history of big data, sociologist David Beer (2016a:1) is of the view 
that “we already have a history of big data that can be found in accounts of the use of 
statistics to know and govern populations”, and it is for this reason that we conduct a brief 
survey of the emergence of statistics through the centuries. Standard statistics was not a 
reality in prehistoric times, but scholars contend that its beginnings can nevertheless be 
found in ancient Sumar, Crete, China, and Egypt, for example, in the form of censuses 
related to livestock, capitation tax, the construction of buildings, and the like (cf. Kitchin 
& Lauriault 2014:2). Essentially, statistics was restricted to collation of the population or to 
the recording of trade activities in most early empires. It appears that the earliest reference 
to statistics appeared in Manuscript on deciphering cryptographic messages written by Sheikh 
Al-Kindi in the ninth century (Singh 2000). 

Before 1600, the origins of statistics are not entirely clear (Creighton 2012), but 
Liber de ludo aleae (Book on games of chance) written by Gerolamo Cardano (1501-1576) 
and published in the sixteenth century was hailed as a sophisticated approach to probability 
calculus (Bellhouse 2005:180). Then, in 1577, a Dominican theologian, Bartolomé de 
Medina (1527-1581), defended the doctrine of probabilism which states that if certainty 
about a particular position or issue is not possible, then the best criterion to follow is 
probability (cf. Decock 2013:75). 

In the seventeenth century, John Graunt (1620-1674) made scientific inferences 
based on the London bills of mortality’ (cf. Kotz 2005:140) and is particularly famous 
for providing statistical details about outbreaks of bubonic plague in London (Sartorius, 
Jacobsen, Térner & Giesecke 2006:181) in his book entitled Natural and political observations, 
mentioned in a following index, and made upon the bills of mortality.’ French mathematicians 
Blaise Pascal and Pierre de Fermat proposed probability theory, which originated out of 


6 In 1603, James I instructed the Company of Parish Clerks to publish weekly accounts of London’s 
births and deaths and this culminated in the bills of mortality (Morabia 2013:1). Interestingly, 
during the London plague of 1665, the Company of Parish Clerks employed mainly elderly women 
called ‘searchers’ to count the number of people who died and to determine their causes of death 
(Slauter 2011:9). At this time, opponents of the bills of mortality heavily criticised searchers for their 
lack of medical training, maintaining that some of them could not be trusted to keep accurate records 
because they often misreported deaths (Slauter 2011:9). 

7 John Graunt’s full text was edited by Walter Francis Willcox in 1939 and published by Johns 


Hopkins Press. The text is regarded as constituting pioneering work on the science of demography 
(Brimblecombe 2017:18). 
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a dispute over a popular game of dice in 1654. This mathematical theory is extensively 
employed for big data analysis today (Singpurwalla & Landon 2014:19). Statistical analysis 
advanced further thanks to Sir Henry Furnese who employed business intelligence (or BI) 
to collect and analyse data about the markets in order to make a profit before his business 
rivals could do the same. His data analysis is described in Richard Millar Devens’ Cyclopaedia 
of commercial and business anecdotes (1865). In an interesting article on big data in historical 
perspective, and basing her thoughts on Hacking’s (1991) essay on the history of statistics, 
Meg Ambrose (2015:203) points to “the avalanche of printed numbers that flooded Europe 
between 1820 and 1840”, a period known as the probabilistic revolution in terms of 
statistical analysis. Ambrose (2015:203) observes important similarities between this period 
and what is occurring today: 


Between 1820 and 1840, a flood of data from across society became 
available, aggregated, analyzed, and acted upon. From this period, a series 
of similarities to big data can be extracted: datafication issues, big data 
lures, and structural changes. A number of social issues surfaced in the 
1800s that have resurfaced today: governability, classification effects, and 
data-based knowledge. Enthusiasm for big data during both periods was 
driven by particular lures: standardized sharing, objectivity, control through 
feedback, enumeration, and the discovery versus production of knowledge. 
Both periods also experience(d) structural changes: division of data labor, 
methodological changes, and a displacement of theory. 


Data visualisation techniques also began to emerge such as those in the form of 
John Snow’s maps employed to track outbreaks of cholera in London in 1854 and Florence 
Nightingale’s (1858) sunburst graph designed to record, amongst other things, the mortality 
rates of British soldiers fighting on the Crimean Peninsula between 1854 and 1855. In 1880, 
statistical computation took a giant leap forward when data processing pioneer Herman 
Hollerith, an engineer employed by the US Census Bureau, responded to data acceleration 
in the form of “an initial societal stress” (Kaplan & di Lenardo 2017:3) which, in the case of 
North America, involved a growing population that had to be counted. Conducting a census 
of the population in 1880 took a total of eight years to complete, but in the meantime, 
Hollerith invented a punch card that helped to significantly reduce tabulation time for 
the 1890 US census to two years (Austrian 1982). It is for this reason that some scholars 
call Hollerith the father of our modern automatic computation,’ although other scholars 
argue that this epithet should be attributed to either Charles Babbage (Cooper 2004:4) 


8 The company that Hollorith founded later became known as IBM (International Business Machines 
Corporation). 
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or to Alan Turing (Daylight 2015:205), a pioneer of theoretical computer science and 
artificial intelligence. 

Twentieth century statistics is equated with Karl Pearson who famously presented 
the notion of the Chi square distribution which constituted “a qualitative leap in applying 
powerful new mathematics (matrix theory) to statistical reasoning” (Efron 2003:32). Other 
influential scholars include chemist and statistician William Gosset, who used his work on 
the t-distribution (1908) to employ methods of sampling in Guinness’ Brewery in Ireland; 
Ronald Fisher, who introduced fundamental concepts such as sufficiency, efficiency, and 
optimality in Statistical methods for research workers (1925); Neyman and Pearson, who 
published a seminal paper on optimality theory for testing problems in 1932; and Gene 
Glass, who coined the term ‘meta-analysis’ in 1976. 


1.3.1 Big business, big data 


In order to understand the role of big data in big companies, its important to 
understand the historical context for analytics and the brief history of big data. 


— Thomas Davenport (2014:194). 


We noted in the section preceding this one that the term ‘business intelligence’ (BI) 
appeared in Devens’ (1865) Cyclopaedia of commercial and business anecdotes which among 
other things described how Henry Furnese utilised BI to outperform his competitors when 
playing the stock markets. What is significant about the study carried out by Furnese is 
that it is regarded by some to constitute the first attempt to exploit data about business for 
commercial purposes (cf. Marr 2015a). In 1958, Hans Peter Luhn developed the notion of 
BI even further when he wrote a paper in JBM Journal of Research and Development outlining 
the development of an intelligence system that would “utilize data-processing machines 
for auto-abstracting and auto-encoding of documents and for creating interest profiles for 
each of the ‘action points’ in an organization” (Luhn 1958:314). It was only in the 1990s, 
however, that BI was widely exploited in the context of business (cf. Wang 2016:673) and 
today it is being reshaped by big data (cf. Fan, Lau & Zhao 2015:28). 

As a term, ‘big data appears to be fairly new in the world of business (Hashem, 
Yaqoob, Mokhtar, Gani & Khan 2015:100), but from the 1960s onwards, businesses across 
the globe began to design centralised computing systems to cope with the data deluge and 
to carry out analytics, which may at a basic level be defined as “the systematic computational 
analysis of data or statistics” (Oxford English Dictionary 2019). Thomas Davenport and Jill 
Dyché (2013:26) point out that it is useful to think about big business and big data in 
terms of what they refer to as Analytics 1.0, 2.0, and 3.0. Between 1954 and 2009, business 
organisations employed traditional Analytics 1.0, which Davenport and Dyché (2013) 
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observe was characterised by internally sourced data that was small and structured and that 
was analysed mainly through descriptive analytics (reporting). The picture changed quite 
dramatically between 2005 and 2012 when huge Internet-based companies such as Google 
and eBay began to exploit the digitisation of huge amounts of data (cf. Stubbs 2010:3). 
Davenport and Dyché (2013) refer to this period as the advent of Analytics 2.0, which was 
characterised by a rather narrow analytical focus on data that allowed companies to carry 
out customer reviews and analyse their product data in real time. What made the Analytics 
2.0 era distinct from Analytics 1.0 was that data was both internally and externally sourced, 
while data was very large and/or unstructured. In addition, companies began to use big 
data processing frameworks such as Hadoop, Spark, and Storm (see Chapter 8) to manage 
and analyse enormous amounts of data with a view to achieving a competitive advantage 
over their rivals. We now appear to be entering Analytics 3.0, which Davenport and Dyché 
(2013) claim combines big data and traditional analytics, but this time with a greater 
emphasis on predictive or prescriptive analytics. Predictive analytics “is about forecasting 
and providing an estimation for the probability of a future result, defining opportunities or 
risks in the future” (Vassakis, Petrakis & Kopanakis 2018:10). Thus, for example, a company 
may choose to apply this kind of analytics in order to predict future trends and patterns as 
they relate to customer behaviour. Prescriptive analytics, by contrast, is aimed at forecasting 
the impact of future actions in order to improve strategic decision-making and generate 
solutions for problems related to costs or the development of new products, for example 
(Vassakis et al. 2018:10). 


1.4 The digital revolution and events surrounding big data 


From the dawn of civilization to 2003, five exabytes of data were created. The same 
amount was created in the last two days. — Google CEO Eric Schmidt (2010:1) 


Whether we are exploring the history of data storage or unearthing the origins and 
development of statistical analysis, what is clear is that it has always been necessary to find 
effective and efficient ways of collecting, processing, and managing data: “[appeals] to the 
‘data deluge’ and ‘information overload’ are not as new as we might be led to believe” (Levin 
2018:668), given that “every age was an age of information, each in its own way” (Darnton 
2000:1). As early as 1941, the huge volumes of data to be managed were described in terms 
of an “information explosion” in the Lawton Constitution, and this explosion has only 
been exacerbated by the digital revolution which could be described in terms of three acts, 
namely, “by the microprocessor and the power to compute, ... by the network and the power 
to connect, [and by] the third [which] will be defined by data and the power to predict” 
(Richards & King 2014:397). To cope with the tsunami of data, there are just over half a 
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million data centres worldwide.’ Currently, there are 54 data centres in Africa, and 21 of 
them are located in South Africa.'° 

South Africa’s own digital evolution can be traced back to 1988 when Francois Jacot- 
Guillarmod, Dave Wilson, and Mike Lawrie set up an email link to the Internet at Rhodes 
University to improve access to the university and to academia.'! These pioneers used a dial- 
up system which was then replaced with Internet access in 1991. Thereafter, commercial 
Internet Service Providers (ISPs) arrived in quick succession, which “propelled South 
Africa to one of the 20 most Internet-connected countries in the world by the mid-1990s” 
(Horowitz & Currie 2007:451). With the arrival of the Internet, a wave of developments 
took place, including the formation of the country’s first commercial Internetworking 
Company of Southern Africa, Tisca (1993), which expanded Internet access even further. 
By 1997, dial-up connections of 56 kilobytes per second (kbps)? had gained traction, 
followed shortly afterwards by Telkom’s 64 kbps Internet service (Friedenthal 2015:2). 
Wireless broadband arrived in 2004 and currently, South Africans have access to broadband 
speeds of over 100 megabits” (Friedenthal 2015:2), while universities enjoy access through 
the cyber infrastructure for big data provided by the South African National Research 
Network, (Friedenthal 2015:2). 

Collating information from Press (2013) and Marr (2015a), the timeline in Figure 
1.1 below visually represents some of the most significant events in the evolution of big 
data identified by Richards and King (2014:397) since the emergence of microprocessors, 
although these events certainly intersect, given that inventions and advances in computing 
overlap. We have included quotations from big data proponents to provide examples of how 
big data has been framed in the mass media, since framing may assist us in understanding 
the values and assumptions encoded in big data (cf. Portmess & Tower 2015:3) which we 
interrogate further in Chapter 2. 


9 Emerson Network Power. 

10 _ http://www.datacentermap.com/africa/. 

11 “The history of Internet access in South Africa” (https://mybroadband.co.za/news/internet/114645- 
the-history-of-internet-access-in-south-africa.html). We relied heavily on this particular site to glean 
information about the history of the data revolution in South Africa, which is sketchy to say the 
least. The site draws on a report entitled ‘Internet access in South Africa 2010’, written by Arthur 
Goldstuck, managing director of World Wide Wox. 

12  Kilobits per second is a measure of bandwith (data transfer speed); one kbps is equivalent to 1000 
bits of bandwith per second. 


13 One megabit equals one million bits. 
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1970 - 1980s 


1989: Computer 
scientist Tim Berners- 
Lee invents the World 
Wide Web 

(cf. McPherson 2009:5). 


1989: Howard Dresner 
expands on the notion 
of business intelligence, 
which is re-defined as 
reflecting “concepts and 
methods to improve 
business decision 
making by using fact- 
based support systems” 
(Powers 2008:232). 


1983: The Advanced 
Research Projects 
Agency Network adopts 
the Transmission 
Control Protocol and 
Internet Protocol, 
beginning to assemble a 
network that eventually 
becomes known as the 
Internet (cf. Leiner, 
Cole, Postel & Mills 
1985). 1985:29). 


1973: Robert Metcalfe, 
a researcher for Xerox, 
writes a memorandum 
in which he describes 
the Ethernet which is 
the most popular local 
area networking (LAN) 
technology today. 


1971: Intel’s first 
microprocessor (4004) is 
conceived by Ted Hoff 
and Stanley Mazor (cf. 
Faggin, Hoff, Mazor & 
Shima 1996:10). 


First decade of 2000s 


1990s 


1999: The term big data 
appears for the first time 
in ‘Visually exploring 
gigabyte datasets in 

real time’ by Bryson, 
Kenwright, Cox, 
Ellsworth and Haimes 
(1999). 


1995: The Cray T3E™ 
supercomputer becomes 
the world’s best-selling 
MPP (massively parallel 
processing) system (cf. 
Chorley 2016:3). 


1992: Crystal Reports, 
a business intelligence 
application, creates the 
first database report 
using Windows. 


2010: Damien Black 
refers to big data as 

a “tsunami”, while 
Google announces 
that we are creating as 
much data every two 
days as was generated 
from the beginn-ing of 
civilisation to 2003. 


2008: As a concept, big 
data appears in Wired 
in a provocative article 
by Chris Anderson who 
declares the end of the 
scientific method. 


2005: Online 
commenters observe 
huge increases in data 
volume owing to Web 
2.0; Apache Hadoop 
is created to store and 
analyse these volumes. 


2001: Doug Laney 
publishes ‘3D 

data management: 
Controlling data 
volume, velocity and 
variety’, the three Vs of 
big data. 


2000: Peter Lyman and 
Hal Varian calculate 
that the world’s digital 
content would require 
1.5 billion gigabytes of 
storage. 


Figure 1.1: Milestones in the era of big data since the introduction of the first commercial microprocessor 
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1.5 Big data’s history lessons 


1.5.1 Revolution versus evolution 


Digital transformation: human evolution, not technological revolution. 


— Richard Mullins (2017:1) 


Big? Smart? Clean? Messy? — Christof Schéch (2013:2) 


Given these events or milestones, whether “[knowingly] or unknowingly, with every Google 
search, every Facebook post, and ... every time we simply turn on our smartphones ... we 
produce metadata” (Richards & King 2014:402), or data about that data (Schéch 2013:3), 
which in turn has resulted in the creation of “a kind of big metadata computer” (Richards & 
King 2014:403) as it were. However, we need to pause momentarily and ask ourselves, Are 
we experiencing a big data revolution? In her essay on the lessons we can glean from the era 
of big data, Ambrose (2015) casts doubt on big data constituting a revolution. Instead, she 
argues that what we are currently seeing is a revolution as it pertains to the sheer volumes of 
data we are currently faced with: “big data is not the revolution — big data is the avalanche 
of numbers which was intimately intertwined with the probabilistic revolution” (Ambrose 
2015:204) referred to earlier on. That the big data transition is emergent is echoed not only 
by Steven Finlay (2014:17), who contends that “[big data is] a process of evolution not 
revolution’, but also by Lazar, Kennedy, King and Vespignani (2014:1203), who suggest 
that we should focus not on a big data revolution, but “on an ‘all data’ revolution’, where we 
recognize that the critical change in the world has been innovative analytics, using data from 
all traditional and new sources, and providing a deeper, clearer understanding of the world”. 
This latter statement will no doubt constitute good news to scholars in the humanities and 
social sciences who are accustomed to working with traditional small data which “often offer 
information that is not contained (or containable) in big data” (Lazar et al. 2014:1205). 


1.5.2 The past in big data 


It is by acknowledging the long history of the accumulation of data about individuals 
and populations that we can begin to make a departure into seeing the different ways 
that data are presented in conceptual terms. — David Beer (2016a:4) 


What the brief history in this chapter also tells us is that the notion that big data has emerged 
out of a vacuum is a common misconception (O'Sullivan 2017:4): “the sense that we are 
being faced with a deluge of data about people is not something that is entirely new” (Beer 
2016a:2), given “the great explosion of numbers that ... occurred during the 1820s and 1830s” 
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(Porter 1986:11), for example. As Kaplan and di Lenardo (2017:1) so succinctly put it, 
“history is punctuated by several Big Data moments which are characterized by a widespread, 
shared sense of information overload alongside rapid societal acceleration accompanied by 
the invention of new intellectual technologies”. Yet for some big data scholars, the history 
behind big data is unfortunately not particularly important. In a cautionary essay entitled 
‘Big data, little history’, Trevor Barnes (2013) passionately criticises individuals who believe 
that history is disconnected from a petabyte world. Barnes (2013:298) specifically targets 
individuals such as Chris Anderson when he argues that: 


... [In so far] as I can contribute to the discussion about big data, it is to 
caution that things are not as different as they might seem. It is not quite 
the brand new day that Chris Anderson supposes. In Chris Anderson’s 
account, history drops out. It might be big data, but it is little history. This 
neglect of history is a typical modernist move. The past is ignored because 
nothing must constrain or limit what is to come. Only the bright new 
future matters. ‘History is more or less bunk as the iconic modernist Henry 
Ford once famously put it. 


What lies at the crux of Barnes’ (2013) message is that the problems and challenges 
that accompanied the deluge of numbers in the past have not disappeared — that any scholar 
who chooses to explore big data needs to be mindful not only of the history of a particular 
phenomenon, but also of the methodological pitfalls experienced by past researchers. 
Kitchin (2014a:136) presents an example from the field of social physics to illustrate what 
can occur if history is ignored. Some scholars in this field have exploited big data analysis 
to draw conclusions about social and spacial processes as they occur in cities (Bettencourt, 
Lobo, Helbing, Kiuhnert & West 2007). Unfortunately, these scholars have “often wilfully 
[ignored] a couple of centuries of social science scholarship ... the result [being] an analysis 
... that is largely reductionist, functionalist and ignores the effects of culture, politics, policy, 
governance and capital, and a rich tradition of work that has sought to understand how 
cities operate socially, culturally, politically, and economically” (Kitchin 2014a:136). These 
scholars have therefore replicated the limitations inherent in the studies of social scientists of 
the mid-twentieth century (Kitchin 2014a:136). 
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1.5.3. Two foundational narratives 


[Two] narratives contribute to structuring a multifaceted definition of the bigness of 
Big Data. — Frédéric Kaplan and Isabella di Lenardo (2017:1) 


A useful point of departure might be to consider that the history of big data points to 
two significant foundational narratives, namely, the so-called data deluge narrative and 
the big science narrative (Kaplan & di Lenardo 2017:1). The narrative behind the big data 
deluge is, as we have seen, that big data has originated out of the limitless possibilities 
offered up by the Internet and online communication networks (Kaplan & di Lenardo 
2017:1), and that we are now attempting what Mayer-Schénberger and Cukier (2013:73) 
have coined datafication — the process of transforming vast quantities of real-time digitised 
data into “structured knowledge systems that document the lives of people, companies and 
institutions, aggregating information about places, topics, or events” (Kaplan & di Lenardo 
2017:2). The big science narrative is an older narrative that tells the story of how researchers 
in diverse fields have been compelled to devise new methodologies to manage huge scientific 
archives known collectively as big science (Kaplan & di Lenardo 2017:2). 

What makes these narratives insightful is that they point to what Kaplan and di 
Lenardo (2017:1) refer to as “an epochal paradigm shift”. Basing his thinking on that of 
Kuhn (1996), Kitchin (2014b:3) observes that “[periodically] ... a new way of thinking 
emerges that challenges accepted theories and approaches”: as a “disruptive innovation” 
(Kitchin 2014b:10), big data is certainly compelling scholars in the (digital) humanities 
and (computational) social sciences to think about using alternative epistemologies in 
their research. However, the various shapes that these epistemologies will take in these 
disciplines are still disputed by scholars (Kitchin 2014b:3), and it is this contestation that 
is interrogated throughout this book. For now, it suffices to say that it is unlikely that the 
approaches behind big data — empiricism and data-driven science — will become substitutes 
for approaches employed in the traditional social sciences and the humanities, given their 
philosophical underpinnings (cf. Friese 2016:35). Kitchin (2014b:3) predicts that what will 
happen instead is that while “[big data] will enhance the suite of data available for analysis 
and enable new approaches and techniques ... [it] will not fully replace traditional small 
data studies”. This sentiment is shared by Danah boyd and Kate Crawford (2012:670), who 
contend that “it is increasingly important to recognize the value of ‘small data’. Research 
insights can be found at any level, including at very modest scales”. 

We turn now to a more detailed discussion of the role of big data in the (digital) 
humanities and (computational) social sciences before attempting to dispel some of the 
myths that scholars in the humanities and social sciences might harbour about employing 
big data in their fields of study. 
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Locating big data in the Gigital) humanities 
and (Computational) social sciences 


In [the] humanities there [has] always been big data. — Amalia Levi (2013:33) 


The rise of big data ... represents a watershed moment for the social sciences. 
— Daniel McFarland, Kevin Lewis and Amir Goldberg (2016:12) 


The very notion of big data creeping into their research spaces casts an intimidating shadow 
over traditional humanists and social scientists, who may fear human behaviour being 
reduced to mere mathematical models (Schirrmacher 2015). Some humanists, for example, 
are of the view that the collection of vast amounts of quantitative data by digital humanists 
is akin to the loss of qualitative meaning (cf. Heuser & Le-Khac 2011:84), while social 
scientists argue that because big data approaches are essentially descriptive they will not 
be able to answer the why and how questions that pertain to the data they are interested in 
(Serfass, Nowak & Sherman 2017:341). 

Notwithstanding the fact that the humanities and social sciences have enjoyed a 
long association, they fundamentally differ in terms of their areas of focus and philosophical 
underpinnings (Kitchin 2014b:7). While scholars in the humanities use critical and 
analytical approaches to study the human condition or experience, those in the social 
sciences generally employ empirical methods to explore human behaviour, although other 
types of research such as theoretical, historical, analytical, and conceptual-philosophical 
research are also carried out (Punch 2013:3). The emergence of big data is now radically 
transforming the landscapes of both disciplines, but the “trajectory” (Kitchin 2014b:7) 
of big data epistemology in the two disciplines remains a topic of heated debate among 
scholars. In this chapter, we consider the literature and what it tells us about the nature and 
impact of big data on the epistemologies of the humanities and social sciences. We also look 
at the new and contested big data connections being forged in the digital humanities and 
computational social sciences. 
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2.1 How big data is framed 


What the hell is big data anyway? — FabCom (2013:1)" 


Torture the data, and it will confess to anything. — Ronald Coase” 


Dominique Boullier (2016:3) correctly observes that when Mike Savage and Roger Burrows 
published “The coming crisis of empirical sociology’ in 2007, little did they know how 
prophetic their predictions would be. Amongst other things, they wrote, “sociologists have 
not adequately thought about the challenges posed to their expertise by the proliferation 
of ‘social’ transactional data which are now routinely collected, processed and analysed by 
a wide variety of private and public institutions” (Savage & Burrows 2007:885). The same 
could of course be said of scholars working in the humanities (Schéch 2013:2). Reflecting 
on their claims in a 2014 paper entitled ‘After the crisis?’, Savage and Burrows point out that 
“[what] must have read as new, innovative and important in 2007 ... now reads to us as a 
pretty mainstream position, not just in sociology but also across the cognate social sciences 
more generally” (Savage & Burrows 2014:1). However, we question how mainstream many 
humanists and social scientists find the big data deluge to be. In contrast to reactions from 
those in the digital humanities'® and computational social sciences, “the response seems 
mixed in the more traditional branches of the social sciences and humanities” (Puschmann 
& Burgess 2013:1691). Much of the big data critique centres around the absence of 
theoretical grounding, the decontextualisation of social phenomena (Puschmann & Burgess 
2013:1691), and the difficulties inherent in having to grapple with data analytics, concerns 
which are constantly acknowledged and addressed throughout this book. A discussion of 
these concerns makes more sense if we first consider how big data is framed in the mass 
media through specific (and seemingly harmless) metaphors because “when we approach any 
novel phenomenon, we begin to understand it first by metaphor. We start to make sense of 
the phenomenon by treating it as if it were like some other phenomenon with which we are 
more familiar” (Adamson & Bakeman 1982:224, cf. Penzold & Fischer 2017:2). Big data is 
framed in contradictory ways, reflecting both fear of it and excitement about the research 
possibilities it opens up to scholars (cf. Lupton 2015:2). 


14 _ https://www.fabcomlive.com/strategic-marketing-agency/wp-content/uploads/What-The-Hell-Big- 
Data-White-Paper.pdf. 


15 Nobel Prize winning economist. The date of this quotation is unknown, although Coase himself 
claims he said this in the 1960s. 


16 We would like to acknowledge the work being done by The South African Centre for Digital Language 
Resources (SADiLaR). On this platform, which is aimed at managing digital resources around language- 
related studies, the digital humanities programme makes use of state-of-the-art data-driven methods to 
carry out research in the humanities and social sciences (See https://www.thesouthafrican.com/lifestyle/ 
government-launches-south-african-centre-for-digital-language-resources/). 
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2.1.1 Two dominant metaphors 


... the people who get to impose their metaphors on the culture get to define what we 
consider to be true ...”. — George Lakoff and Mark Johnson (1980:160) 


In the previous chapter, we traced significant big data moments, particularly after the 
advent of the Internet, and noted that the media made use of statements pertaining to the 
“controlling [of] data volume” (Laney 2001:70), “the end of theory” (Anderson 2008:1), 
and to data as a “tsunami” (Black 2010:1), a “powerful natural resource” (Mike Rhodin in 
Groenfeldt 2012:1), a “gold rush” (Peters 2012:1), “the new oil” (Toonders 2014:1), and 
“big data-as-service” (Marr 2015b:1). Such pronouncements are significant as they highlight 
two metaphors Puschmann and Burgess (2014:1691) argue have become dominant, namely, 
that “big data is a force of nature to be controlled” and that “big data is nourishment/fuel 
to be consumed” (cf. Penzold & Fischer 2017:2). The message behind the first conceptual 
metaphor is that although big data is overwhelming, it can nevertheless be converted into a 
resource if effectively controlled. However, 


[data] is not a natural resource that replenishes itself, but ... is created by 
users with intentions entirely unrelated to its use as a valued commodity. 
It is created by humans and recorded by machines rather than being 
discovered and claimed by platform providers or third parties. At the same 
time, it is generally not used for the purpose for which it was collected. 
Its mass makes it easy to deliberately ignore individual items in favor of 
aggregate properties (Puschmann & Burgess 2014:1691). 


In addition, Puschmann and Burgess (2014:1699) point out that big data cannot be 
perceived as a value-neutral resource because in the first place, its value differs depending on 
whether we are referring to its ‘owners’, generators or collectors, and in the second, its value 
is borne out of analysis: it is not “inherent in some sort of natural form of consumption” 
(Puschmann & Burgess 2014:1699). The gold rush metaphor is a particularly odd one, 
since “[suggesting] that the intrinsic meaning of data is, like nuggets of gold, already there, 
just waiting to be uncovered, means distancing the interpretation from the interpreter and 
her subjectivity” (Puschmann & Burgess 2014:1699). This is problematic since scholars in 
the humanities and social sciences who employ qualitative methodology rely on subjective 
processes to understand social phenomena (Ratner 2002). 

The message behind the second metaphor is essentially that big data is a resource 
that needs to be consumed to secure survival and that it is, in a sense, a fuel that drives 


17 In Chapter 4, we will discuss the notion that information is owned by information empires such as 
Twitter, Apple, and Facebook because they trade in information generated by their users. 
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businesses and institutions (Puschmann & Burgess 2014:1700). Again, it is a metaphor that 
is far from innocuous, since it conveys the notion that “the consumption of data strengthens 
the company or institution while requiring no or very little conscious interpretation or 
reflection” (Puschmann & Burgess 2014:1700; cf. Crawford, Miltner & Gray 2014:1669 ). 
Dawn Holmes (2017:20) adds that the framing of big data as oil is essentially a marketing tool 
exploited by data analytics to sell their products and services. She adds that “the metaphor 
only holds so far. Once you strike oil you have a marketable commodity. Not so with big 
data; unless you have the right data you can produce nothing of value” (Holmes 2017:20). 

These two metaphors reflect a conflict between two paradigms — the big data 
paradigm in which data is seen as neutral, existing independently of context (Kitchin 
2014a:145) and “an older [paradigm]” (Puschmann & Burgess 2014: 1702) in which data 
is perceived to be socially constructed. 

In marked contrast to the second paradigm, the first suggests that meaning surpasses 
context, “that anyone with a reasonable understanding of statistics should be able to interpret 
[data] without context or domain-specific knowledge” (Kitchin 2014b:2). Kitchin (2014b:5) 
calls this “a conceit” on the part of some data scientists who are now undertaking social 
science and humanities research, sometimes without regard to the subject matter expertise 
required in these fields. This lament has also been voiced by data scientist Jake Porway 
(2013:2), who remarks that “[as] data scientists, we are well equipped to explain the ‘what’ 
of data, but rarely should we touch the question of ‘why’ on matters we are not experts in’. 

On the subject of conceit, qualitative researcher Annette Markham (2013:1) notes 
that she was at one stage asked the following questions by a big data analyst: 


How can we make qualitative research more important in the arena of big 
data? If big data is the purview of quantitative and computational analysts 
and qualitative researchers do not want to be left behind, how can they 
better inform big data research and researchers? 


Markham (2013) understandably goes on to state that these questions troubled her 
since they are based on two faulty premises. With regard to the first mistaken assumption — 
that it is only computational analysts and quantitative researchers who use big data — data 
in the humanities and the social sciences can also be big (cf. Levi 2013:33). However, as we 
will see in Chapter 3, (1) “[it is] not just about [the] size of the data” (Lazar et al. 2014:1204) 
and (2) size is in any case only one of several defining dimensions of big data (Gandomi 
& Haider 2015:143). The second erroneous supposition reflected in the set of questions is 
that qualitative research has no place in big data, but a number of studies put paid to this 
notion. Kathy Mills (2017:15) observes that “[qualitative] researchers are well-positioned 
to generate research questions, and to select, curate, interpret and theorize big data away 
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from reductionist claims”, and provides many examples of instances in which qualitative 
research has been used to bolster big data studies. In the field of communication research, for 
instance, Christine Lohmeier (2014:75) argues that combining qualitative research methods 
with big data analytics yields fruitful results for illustrating the dynamics of large online 
communities. In the area of the sociology of technology and innovation, Snijders, Matzat & 
Reips (2012:2) are of the view that in order to understand the micro-processes that occur in 
social media networks, employing mathematical models is not sufficient and that analysts 
should also use social science theories to guide their research. Parker, Saundage and Lee 
(2011:6) recommend that social informaticians combine data analytics and qualitative 
content analysis to deepen their insights into how people appropriate social media discourse: 
“social informaticians require a qualitative research method which gives them the flexibility 
to allow their research questions and units of analyse to emerge (or change) inductively 


throughout the process”.'8 


2.1.2 A collaborative effort 


.. analysis of big data is an interdisciplinary research, which requires experts in 


different fields to harvest the potential of big data. 
— Min Chen, Shiwen Mao and Yunhao Liu (2014:176) 


The above studies have been carried out by researchers who have a background in data 
analytics, which is certainly not the case for many humanists and social scientists, who, 
like Kevin Lewis (2015:3), might describe themselves as “statistically challenged, Python- 
ignorant digitalphobe[s]”. It is also not helpful when messages with a slightly threatening 
undertone are bandied about such as Lev Manovich’s (2012:472) “[you had] better have 
knowledge of [big data analytics]” if you want to understand big social datasets”. What is 
often overlooked is that many data analysts do not have any expertise in the humanities 
and social sciences, and “[without] subject matter experts available to articulate problems 
in advance, you get [poor] results ... Subject matter experts are doubly needed to assess 
the results of [a study], especially when you're dealing with sensitive data about human 
behavior” (Porway 2013:2). This conundrum is also touched on by Salah, Manovich, Salah 
and Chow (2013:411) in the context of analyses of social network sites or SNSs: “[social 
network sites] are studied so far either by social scientists, who lacked the necessary tools and 
expertise to conduct research on large-scale datasets, or by physicists who lacked the research 
goals of social scientists in exploring the SNSs for inquiries about social phenomena’. 

One solution to this dilemma has been offered by scholars such as Stockmann 
(2016:23), Mills (2017:9), and Ford (2014:1), who recommend that big data analysts come 


18 See Chapter 7 for a brief discussion of content analysis and big data. 
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together with social scientists and humanists in order to overcome gaps in their knowledge 
and expertise. Ford’s (2014) contribution reflects an interesting narrative about how, as an 
ethnographer working on projects related to Wikipedia sources, she collaborated with data 
scientists Dave Musicant and Shilad Sen: 


.. Dave and Shilad had the necessary skills and resources to extract over 
67 million source postings from about 3.5 million Wikipedia articles ... 
[and] I was able to contribute ideas about different ways of slicing the data 
in order to gain new insights. Dave and Shilad had access to sophisticated 
software and data processing tools for managing such a high volume of 
data, and I had the knowledge about Wikipedia practice that would inform 
some of the analyses that we chose to do on this data (Ford 2014:3). 


In doing this kind of interdisciplinary research, Ford (2014:2) discovered that her 
skills and those of the two data scientists complemented each other, while making collective 
discoveries about the data yielded far more fruitful results than attempting to work in 
isolation. More importantly, she notes that contrary to her pre-conceived notions as to 
how big data scientists go about conducting research, her colleagues actually preferred to 
follow an inductive approach to analysing the Wikipedia dataset, and were also systematic in 
challenging any initial assumptions they had made about the data (Ford 2014:2). 


2.2 Digital humanists and computational social scientists 


What is humanistic about visualisation? — Elyse Graham (2017: 449) 


... [Mounting] evidence suggests that many of the forecasts and analyses being 


produced [by powerful computational resources] misrepresent the real world. 
— Derek Ruths and Jiirgen Pfeffer (2014:1063) 


Encouraging humanists/social scientists to collaborate with digital humanists/computational 
social scientists, on the other hand, might prove to be more difficult. If we look for a moment 
only at the picture in the digital humanities, it appears that traditional humanists feel 
obligated to keep justifying their research activities, criticising digital humanists for bowing 
to the pressure to prove that their research has utilitarian value, given that many institutions 
of higher learning have become corporatised. In this regard, a number of South African 
universities too appear to have surrendered to corporatisation (Clare & Sivil 2014:60). In 
“The dark side of the humanities’, Richard Grusin (2014:87) goes so far as to contend “that 
it is no coincidence that the digital humanities has emerged as ‘the next big thing’ at the same 
moment that the neoliberalisation and corporatisation of higher education has intensified 
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in the first decades of the 21st century”. Grusin (2014:87) also identifies another serious 
tension that exists owing to the notion that some humanists hold that digital humanists 
tend “to ‘make things’ rather than ... critically comment on issues”. In this respect, one 
of the more contentious issues for scholars in the literary field is that digital humanists 
use statistical processing and data aggregation techniques to analyse massive amounts of 
data to ‘read’ a text, which is a process referred to as distant reading (Moretti 2005), as 
opposed to hermeneutic close reading. Advocates of the former kind of reading assert that 
machine reading helps uncover aspects of a text (related to vocabulary, genre, and themes, for 
example) on a scale that is impossible for human beings to achieve. They also contend that 
visualisation techniques, which entail the graphical display of aspects of a text in the shape 
of maps, tag clouds or graphs to name a few, assist them in making sense of that text (Jockers 
2013; Janicke, Franzini, Cheema & Scheuermann 2015) and that big data can significantly 
improve on these visualisation techniques. Visualisation techniques are certainly not new as 
noted in Chapter 1 when we referred to John Snow’s and Florence Nightingale’s maps and 
graphs, respectively. In defense of (big) data visualisation in the digital humanities, Graham 
(2017:450) argues that traditional humanists mistakenly assume that data visualisation is 
merely a “discovery tool” when it actually “serves ... to refine arguments already made or 
illustrate conclusions already drawn”. Similar to the recommendation made by Ford (2014), 
Stockmann (2016), and Mills (2017) for team-work between traditional humanists and 
data scientists, Graham (2017:455) calls for closer collaboration between literary scholars 
and digital humanists, although, interestingly, this kind of suggestion is frowned on by 
some scholars. Urszula Pawlicka (2017:456), for instance, argues that collaboration “is 
less a development arising from new technologies than a response from within humanities 
departments to ‘neo-economic’ forces that have driven humanities departments to a point 
of crisis”. Nevertheless, a review of the literature shows fruitful collaboration between 
literary scholars and digital humanists in a few instances, particularly when it comes to 
using visualisation to conduct both close and distant readings of texts (cf. Wang Boldonado, 
Woodruff & Kuchinsky 2000) (see Chapter 7). 

Karin van Es and Mirko Schäfer (2017) suggest two possible solutions to the 
concerns traditional humanists have when it comes to the digital humanities. First, they 
assert that these scholars should not lose sight of the fact that the digital humanities is not 
a new field — that the term “is merely the nom de guerre of the computational turn in the 
humanities”, which is another reminder not to be swept up in the hype of the big data 
evolution (van Es & Schäfer 2017:15). Indeed, these scholars anticipate that rather than 
replacing existing methodologies in the traditional humanities, the digital humanities will 
simply expand traditional humanists’ current methods, which is a very similar prediction to 
the one made by Kitchin (2014b:10) that big data methods will not become a substitute 
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for the methodologies employed in the humanities and social sciences. Second, traditional 
humanists should embrace the opportunities the digital humanities has to offer with a view 
to studying society based on large quantities of data: 


Because datafication is taking place at the core of our culture and social 
organization, it is crucial that humanities scholars tackle questions about 
how this process affects our understanding and documentation of history, 
forms of social interaction and organization, political developments, and 
our understanding of democracy (van Es & Schafer 2017:14). 


Proponents of digital humanities are divided into those who still follow the first 
wave of digital humanities (cf. Schnapp, Presner & Lunenfeld 2009), which (from the late 
1990s to the early 2000s) was characterised by quantitative research, the belief being that the 
use of techniques such as textual mining, graphing, and mapping result in “methodological 
rigour and objectivity” (Kitchin 2014b:7-8), and those in the second wave, who assert 
that these kinds of techniques should only supplement traditional humanities methods 
(Kitchin 2014b:8). Individuals in the first camp are criticised for being mechanistic, 
encouraging superficial analysis “rather than deep, penetrating insight” (Kitchin 2014b:8) 
and unfortunately, as Kitchin (2014b:8) points out, computational social scientists are also 
sometimes guilty of reductionist techniques. By way of illustration, he describes a study in 
which a group of researchers mapped millions of tweets over a three-year period to determine 
which geographical areas in a city were more multilingual than others (Rogers 2013). The 
map became “the end-point” (Kitchin 2014b:6), since the researchers did not go beyond 
this preliminary stage to utilise social theory and consider context with a view to answering 
pertinent questions related to people's social and economic backgrounds, for example. 

However, the fact that big data is being utilised by social scientists cannot be ignored. 
Youtie, Porter and Huang (2017:65) correctly point out that little research has been carried 
out to determine the role that social scientists (and humanists) are beginning to play in 
informing the emergence of big data technologies. Drawing on 488 research papers, these 
scholars conducted a comprehensive review of early social science and humanities research 
in the area of big data. They concluded that the majority of the papers drew on specific 
knowledge sources when researching big data, namely, the Internet and society, big data and 
medicine, law and privacy, and business impacts studies (Youtie et al. 2017:67). 

Political scientists, communication scholars, sociologists, and psychologists, to name 
a few, are beginning to consider how computational tools may best be exploited to help them 
gain deeper insights into people and their interactions with one another (Shah, Cappella & 
Neuman 2015:6). When we look at the connection between big data and computational 
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social science and at the challenges this entails, it might be a good idea to first define the 
latter in relation to the former. Shah et al. (2015:7) define the term as follows: 


It is an approach to social inquiry defined by (1) the use of large, complex 
datasets, often — though not always — measured in terabytes or petabytes; 
(2) the frequent involvement of “naturally occurring” social and digital 
media sources and other electronic databases; (3) the use of computational 
or algorithmic solutions to generate patterns and inferences from these 
data; and (4) the applicability to social theory in a variety of domains from 
the study of mass opinion to public health, from examinations of political 


events to social movements. 


As is the case for traditional social scientists, some computational social scientists 
remain sceptical about big data’s usefulness, while others have adopted an extreme view, 
arguing that the methodology behind big data is superior to that employed by computational 
social scientists. Proponents of such a view include Matthew Hindman (2015:48), who 
provocatively states that “[analytic] techniques developed for big data have much broader 
applications in the social sciences, outperforming standard regression models even” and 
Jimmy Lin (2015:33), who is of the view that “[if] the end goal of big data use is to engineer 
computational artifacts that are more effective according to well-defined metrics, then 
whatever improves those metrics should be exploited without prejudice”. Some scholars take 
a more moderate stance, such as Shah et al. (2015:9), who cast doubt on the notion that big 
data will entirely replace surveys, clinical trials, and analysis. 

Mirroring the trepidations of humanists and social science scholars unfamiliar with 
big data, computational social scientists’ concerns revolve around ethics, the subordination 
of theory to data, and the very real danger of data validity being compromised, given the 
oftentimes questionable representativeness of social data, concerns we interrogate more 
closely in Chapters 3 and 4. For geography scholar David O’Sullivan (2017:9), a greater 
problem — and one that Kitchin (2015a, 2015b) has brought up more than once — lies in how 
big data represents processes in computational social science. What he finds problematic is 
that huge datasets only create the impression of capturing dynamism, and that many big data 
scientists never actually explain the processes behind the changes that must surely be taking 
place in the data: “[process] and change are ... rendered as ‘one damn thing after another 
with no notion of process or mechanism in the data themselves” (O’Sullivan 2017:9). In his 
view, a bottom-up, emergentist approach (which is commonly associated with complexity 
science) rather than a top-down aggregate one is better suited to process and explanation.” 
Table 2.1 illustrates the main differences between complexity science and big data: 


19 Digital humanities is associated with the former approach. (See Burdick, Drucker, Lunenfeld, Presner 
& Schnapp’s (2012) Digital_humanities in this regard.) 
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Table 2.1: Two approaches to big data analysis (O’Sullivan 2017:15) 


Complexity science Big data 

Embeds theory in models Correlation and classification 
Represents process Temporal snapshots 

Open-ended exploration of process implications | Exploration of already-collected data 
Bottom up Top down 

Multiple levels and scales Two levels: aggregate and individual 
Many alternative histories (or futures) ‘Just the facts’ (or optimal solutions) 


As the table illustrates, complexity science and big data appear to reflect divergent 
methodological perspectives and intellectual styles in the context of the digital humanities 
and computational social sciences. Yet it would be a mistake to conclude that the two 
traditions are worlds apart. Indeed, “[both] are about fitting simple models to observation” 
(O'Sullivan 2017:17), while complexity science may complement big data. Of course this 
does not mean that complexity science is the answer to the various questions posed by 
researchers: “[just] as it is foolish to believe that data-mining Big Data can provide answers 
to every social science [and humanist] question, it would be foolish to argue that simple 
complexity science models can answer every question” (O’Sullivan 2017:18). 

We hold the view that complexity theory is of value to both (digital) humanists 
and (computational) social scientists, given that it helps foster a deeper understanding of 
human and social phenomena. In this respect, Youngman and Hadzikadic’s (2014) scholarly 
publication Complexity and the human experience discourses at length on quantitative 
analyses currently being undertaken in the humanities and social sciences. Their work 
may be regarded as ground-breaking since it applies the principles of complexity theory 
to research fields we would not generally associate with it. These fields include literature, 
anthropology, political science, and sociology. However, we also acknowledge the value of a 
qualitative approach to the analysis of big data as several chapters in this book reveal. Before 
doing so, we first turn to uncovering some of the common misperceptions about big data 
analysis that persist in the humanities and social sciences. 
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Big Data, big despair: 
Myths debunked and lessons learned 


The temptation to form premature theories upon insufficient data is the bane of our 
profession. — Sherlock Holmes”? 


The misconceptions that surround big data are numerous, which is perhaps not surprising 
given the fact that scholars unfamiliar with big data are regularly confronted by messages 
such as Volume is all that truly counts (cf. Jagadish 2015), Big data does not need theory 
(cf. Bowker 2014), and Big data is not generally associated with qualitative research methods 
(cf. Chen & Zhou 2017). On the other hand, scholars are also bombarded by messages that 
contradict these admonitions. It is somewhat confusing for researchers to also encounter 
claims such as Small data sources can become big data sources (cf. Ferguson, Nielson, Cragin, 
Bandrowski & Martone 2014) and Qualitative research methods may be employed to analyse 
large datasets (cf. Housley, Dicks, Henwood & Smith 2017; Mills 2017). 


3.1 Epic fails 


Data, in the wrong hands, whether malicious, manipulative or naive can be 
downright dangerous. — Donald Clark (2013:1) 


Globally, the term ‘big data has a tendency to engender fear, mistrust and — perhaps most 
worrisome of all — inertia among many humanists and social scientists, and the situation 
appears to be no different in South Africa. At the time this chapter was drafted, an Internet 
search of what South African universities are doing to prepare humanists and social scientists 
for the era of big data yielded very little information.” What appears to partially drive 
negative attitudes towards big data has to do with a number of myths surrounding this meme, 
myths which we aim to explore and debunk in and across several chapters in this book. At 
the risk of sounding contradictory, we would like to invite humanists and social scientists 
to be sceptical about big data — at least for now. Scepticism is not an unhealthy inclination 


20 Arthur Conan Doyle (1915), The valley of fear (George H. Doran Company). 

21 In South Africa at least, big data analysts appear to be mainly big data scientists who have expertise 
in new statistical and programming skills. Lack of computational skills may be one reason why social 
scientists and humanists have been slow to tap into the benefits of big data theory. 
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in the face of the big data deluge (Davenport 2014:2). Indeed, “big-data related efforts 
have had as many failures as they have [had] successes” (Schilling & Bozic 2014: 3272). 
To date, Google Flu Trends or GFT is arguably the most well-known big-data foible of the 
millennium (Lazar, Kennedy, King & Vespignani 2014:1203). Attempting to accurately 
forecast the outbreak of influenza in some 25-odd countries from 2008 onwards, this web 
service was 140 percent off in its predictions in 2013 due to flawed algorithmic dynamics.” 
In addition to human error and misinterpretation of data, GFT did not take into account 
that online users searching for ‘cough’, ‘headache’ or ‘fever’, for instance, might have been 
looking for information on topics not related to influenza-related illnesses. In this chapter, 
we interrogate what can be mined (pun intended) from the GFT failure and others as well 
as from the misconceptions surrounding big data, centering our discussion on how various 


lessons may be exploited by social scientists and humanists carrying out big data research. 


3.2 Big data lessons 


3.2.1 Lesson 1 
With big data comes big noise. — Beth Mole (2015:1) 


Above all else, assess data quality. 


In “The parable of Google Flu’, Lazar et al. (2014) caution against what they refer to as 
big data hubris, a term they employ to describe the penchant among some analysts to 
entirely replace traditional data collection and data analysis methods with big data. Relying 
exclusively on massive quantities of data generated by Internet services, what developers of 
the first version of GFT failed to do was to employ instruments aimed at establishing valid 
and reliable data: 


The odds of finding search terms that match the propensity of the flu but 
are structurally unrelated, and so do not predict the future, were quite 
high. GFT developers, in fact, report weeding out seasonal search terms 
unrelated to the flu but strongly correlated to the CDC [Centre for Disease 
Control] data, such as those regarding high school basketball ... (Lazar et 
al. 2014:1203). 


22 In fact, “the algorithm has been criticised for overfitting a small number of cases and masking a 
simple question, namely, does it predict flu or is it merely reflecting the incidence of winter...” 


(Agarwal & Dhar 2014:446). 
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They add that “[t]his should have been a warning that the big data were overfitting 
the small number of cases, a standard concern in data analysis” (Lazar et al. 2014:1203). 
The first lesson for social scientists and humanists? To be of value to scholarship, any 
big data collected and analysed should be accurate and of high quality. Is this easier said 
than done? To answer this question, we reviewed a study by Cai and Zhu (2015) who 
have proposed what we perceive to be a useful data quality framework and accompanying 
dynamic assessment process to assess the quality of data. Any scholar new to big data needs 
to first be aware that big data cannot be simplistically defined as data that is too large to 
fit into an Excel spreadsheet, for example (Kitchin 2014b:1). Rather, big datasets reflect 
distinct features or vectors commonly referred to as the five Vs,” namely, volume, velocity, 
variety, veracity, and value (Katal, Wazid & Goudar 2013). Volume is self-explanatory, 
whether it is measured in the number of records, the amount of storage space it requires 
or the completeness of the dataset. Velocity has to do with the need to analyse big datasets 
in a timely fashion, since data is produced at breakneck speed and changes just as quickly. 
A defining and third feature of big data has to do with the variety of sources and content 
available to researchers, and the data are in turn divided into structured, semi-structured, 
and unstructured data demanding sophisticated data processing capabilities (see Chapter 8). 
Veracity refers to the trustworthiness of the data collected and analysed, and finally, value 
refers to the worth of the data, since huge amounts of data have no value if worthwhile 
information and knowledge cannot be extracted from them. Being aware of these features 
is crucial if data quality is to be accurately assessed.” With these vectors in mind, Cai and 
Zhu (2015)” suggest that analysts use their hierarchical data quality framework illustrated 
in Table 3.1 


23 We only touch on big data’s four Vs here; Chapter 8 provides a more detailed, technical account of 
the technologies required to manage huge volumes of data. 


24 Some users of big data have added additional Vs depending on their research objectives, and these 
are visualisation, variability, and value (see Chen & Zhang (2014) for a definition and explanation 
of each vector). Some scholars have also added exhaustivity which demands that the data collected 
represent the entire population under investigation (Kitchin & Lauriault 2015:464). 


25 Of interest is the observation by some big data analysts that researchers favour specific dimensions 
over others, depending on their disciplines. Researchers who focus on the Internet, for example, 
emphasise velocity, while humanists and social scientists highlight value and veracity. In this respect, 
see Hitzler & Janowicz (2013). 
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Table 3.1: Cai and Zhu’s (2015:5) big data quality framework 


Dimensions 


Availability 


Elements 


Accessibility: 


Timeliness: 


Indicators 


Whether a data access interface is provided 

Data can be easily made public or easy to purchase 
Within a give time, whether the data arrive on time 
Whether data are regularly updated 


Whether the time interval from data collection and processing to 
release meets requirements 


Usability 


Credibility: 


Data come from specialized organisations of a country, field, or industry 


Experts or specialists regularly audit and check the correctness of the 
data content 


Data exist in the range of known or acceptable values 


Reliability 


Accuracy: 


Consistency: 


Integrity: 


Completeness: 


Data provided are accurate 


Data representation (or value) well reflects the true state of the source 
information 


Information (data) representation will not cause ambiguity 


After data have been processed, their concepts, value domains, and 
formats still match as before processing 


During a certain time, data remain consistent and verifiable 

Data and the data from other data sources are consistent or verifiable 
Data format is clear and meets the criteria 

Data are consistent with structural integrity 

Data are consistent with content integrity 


Whether the deficiency of a component will impact use of the data 
for data with multi-components 


Whether the deficiency of a component will impact data accuracy 
and integrity 


Relevance 


Fitness: 


The data collected do not completely match the theme, but they 
expound one aspect 


Most datasets retrieved are within the retrieval theme users need 


Information theme provides matches with users’ retrieval theme 


Presentation 


quality 


Readability: 


Data (content, format, etc.) are clear and understandable 
It is easy to judge that the data provided meet needs 


Data description, classification, and coding content satisfy 
specification and are easy to understand 
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The data quality assessment process itself involves following a number of specific steps 
outlined in Figure 3.1. 


—» Determining quality dimensions 
and elements 


| 


+ Determining indicators 


| 


Ba Data cleaning 


Goals of data collecting 


Formulating 
evaluation baseline 


Data collection 


Safety baseline? Data quality assessment 


Output data Generating data quality report 


Data analysis and data mining 


Safety goals? Output results 


Figure 3.1: Cai and Zhu’s (2015:7) quality assessment process 
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Although both the framework and process were specifically designed for commercial 
purposes, we argue that they may be adapted to suit different research environments in the 
sense that academics may simply select data quality dimensions that will help them achieve 
their research objectives. Let us consider a scenario in which social scientists would like to 
explore the attributes of civil and uncivil discourse online. To achieve this goal, they decide to 
collect posts generated by commenters on online news sites. When it comes to determining 
quality dimensions, the need for them to take timeliness, accuracy, and completeness into 
account will obviously be prioritised (Cai & Zhu 2015:7). Since social media data is raw 
data, it will also be necessary to assess its credibility (Cai & Zhu 2015:7). Dimensions such 
as consistency and integrity will not be particularly useful to them, since social media data 
is generally unstructured (Cai & Zhu 2015:7). The next step will be to select indicators for 
every dimension chosen. Thus, for example, in terms of timeliness and completeness, the 
researchers will have to be aware that commenters’ posts on online news sites are regularly 
updated, and that they will therefore have to make sure that they access and download 
the latest posts so that their dataset is complete. They will also have to ensure that they do 
not miss the window of opportunity for downloading data, since online news outlets may 
close down comments sections at specific times and without prior warning. Assessing the 
quality of each dimension will allow the researchers to determine if they are satisfied with the 
baseline standard. If the baseline standard is satisfactory, they will draft a quality report and 
then enter the data acquisition phase during which the data is scrubbed or cleaned with the 
aim of “[detecting] and [removing] errors and inconsistencies from data in order to improve 
their quality” (Cai & Zhu 2015:7) (see Chapter 7 as well). A little later on in this chapter, 
we more closely consider how the use of big data complicates the research process when it 
comes to assessing data quality, but for now we assume that the researcher is ready to analyse 
his data, which brings us to the next lesson. 


3.2.2 Lesson 2 


[In the age of big data] [qualitative research can illuminate questions that need 
asking and real problems that need solving. — Elizabeth Kaufer (2016:1) 


The growth of quantitative technique may be counterbalanced by qualitative 
understanding. — Joshua Fairfield and Hannah Shtein (2014:49) 


Think qualitatively (too). 


Probably one of the most important lessons for social scientists and humanists is that the birth 
of big data is not, in fact, the end of the scientific method, a provocative pronouncement made 
by Chris Anderson in 2008 when he was the editor-in-chief of Wired. Anderson (2008:5) 


34 


Big Data, big despair: Myths debunked and lessons learned 


claimed, amongst other things, that “[p]etabytes allow us to say: “Correlation is enough.’ 
We can stop looking for models. We can analyze the data without hypotheses about what it 
might show. We can throw the numbers into the biggest computing clusters the world has 
ever seen and let statistical algorithms find patterns where science cannot”. In a tongue-in- 
cheek response to the end of theory declaration, Geoffrey Bowker (2014:1797-1798) puts 


2» 


it aptly when he observes that “this is a massive reduction of what it means to ‘know” and 
that “any ‘thing’ we create (object, way of looking at the world) embodies theory and data’. 
Interestingly, the notion that big data is the re-birth of empiricism is one that anti-big-data 
researchers cannot seem to shake, yet the paradox lies therein that “it is not just empirical- 
quantitative” (Cope & Kalantzis 2015:226) in nature. In fact, scholars have echoed Bowker’s 
(2014) sentiments, reiterating that “[big data] demands more conceptual, theoretical, 
interpretative, hermeneutical — indeed qualitative — intellectual work than ever” (Cope & 
Kalantzis 2015:227). 

The importance of qualitative thinking is echoed by a number of other researchers 
who are interested in exploring big data. Two of these are boyd and Crawford (2012:670), 
who argue that “there remains a mistaken belief that qualitative researchers are in the business 
of interpreting stories and quantitative researchers are in the business of producing facts”. 
They offer the caveat that such narrow thinking means that “[big data] risks reinscribing 
established divisions in the long running debates about scientific method and the legitimacy 
of social sciences and humanistic inquiry” (boyd & Crawford 2012:67). Mazzochhi 
(2015:1253) also expresses it well when he maintains that 


[scientific] research does not take place in a purely theoretical and rational 
environment of facts, experiments and numbers. It is carried out by 
human beings whose cognitive stance has been formed by many years of 
incorporating and developing cultural, social, rational, disciplinary ideas, 
preconceptions and values, together with practical knowledge. 


Where does the notion come from that big data makes theoretical assumptions and 
hypotheses redundant? We know that this is certainly not a new idea, and that it harks back 
to Francis Bacon’s (1620) Novum organum in which he subordinated the testing of theory 
to observation and analysis. Isaac Newton (1713) too was unimpressed by theory, famously 
remarking “hypotheses non fingo” (“I frame no hypothesis”) in the second edition of the 
Principia. In the last few years, some advocates of big data have argued quite convincingly 
in favour of correlation being superior to causation, notably Mayer-Schénberger and Cukier 
(2013) in their book entitled Big Data: A revolution that will transform how we live, work, and 
think (Houghton Mifflin Harcourt). Mayer-Schénberger and Cukier (2013:14) maintain 
that “correlations may not tell us precisely why something is happening, but they alert us 
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that it is happening. And in many situations this is good enough”. We support Mazzocchi’s 
(2015:1252) view that the “no theory” thesis is not good enough for academics — that 
“understanding the why is crucial for reaching a level of knowledge that can be used with 
confidence for practical applications and for making predictions”. As social scientists or 
humanists, we thus need to analyse big data within specific theoretical and methodological 
limitations in order “to assign them a meaning and to distinguish between meaningful and 
spurious correlations” (Mazzocchi 2015:1252). 

We would not want to create the impression that the majority of big data advocates are 
anti-theory, however. A number of researchers have called for big data projects to be driven or 
enhanced by theory (Frické 2013; Coveney, Dougherty & Highfield 2016; Sparks, Ickowicz 
& Lenz 2016; Wu, Buyya & Ramamohanarao 2016; Olshannikova, Olsson, Huhtamaki & 
Kärkkäinen 2017). Some researchers in the social sciences, notably, Monroe, Pan, Roberts, 
Sen and Sinclair (2015:71), maintain that big data and social science research methods are 
not entirely unsuited, and that big data may enhance these methods and “[enable] us to 
answer new questions’. In fact, they go on to argue that “it is the responsibility of social 
scientists to assume their central place in the world of big data” (Monroe et al. 2015:71) 
because a great deal of big data reflects social data. How much data is sufficient in the world 
of social scientists though? 


3.2.3 Lesson 3 


More data do not necessarily generate more knowledge. 


— Fulvio Mazzocchi (2015:1253) 


Be mindful that size is relative. 


From a computational point of view, big data is information that is so large it can only be 
retrieved and organised with the assistance of special tools discussed in Chapter 8. However, 
and here we debunk yet another myth, “from a social sciences perspective big data is not always 
that big” (Beneito-Montagut 2017:915). What is important is not size, but rather accessing 
“more data, quicker and richer than before” (Beneito-Montagut 2017:915). What scholars 
should bear in mind is that the size or volume of data differs from industry to industry, 
and is therefore not the defining characteristic of big data (Gandomi & Haider 2015:143). 
Indeed, in the context of the humanities or social sciences, “[big data] can safely be reduced 
to medium-size data and still yield valid and reliable results” (Mahrt & Scharkow 2013:28), 
provided that validity is achieved when it comes to the sampling process. Of course, the 
immediate question then becomes, But what about the generalisability of one’s findings? It may 
be surprising to academics working in the social sciences and in the humanities to learn that 
more data from a particular source is not needed to achieve generalisablity. What is required 
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is “horizontal expansion” (Mahrt & Scharkow 2013:26) or more data from multiple data 
sources (Mahrt & Scharkow 2013:26). To understand what this means in practical terms, 
we briefly describe a research study by Gray, Jennings, Farrall and Hay (2015) significantly 
entitled “Big small data”. 

Exploring social and economic change at a national level in Britain, Gray and her 
team (either criminologists or political scientists) were mindful of the need to make use of 
large datasets while staying true to “the essential social and cultural heredity that is intrinsic 
to the human sciences” (Gray et al. 2015:1). One of their preliminary findings in the context 
of economy and crime rates was that property crime spiked when unemployment levels 
rose, which in turn increased citizens’ fear of crime as well as the government’s attention to 
that crime. What the researchers needed in order to examine attitudes towards other kinds 
of crime was an integrated model of analysis that would allow them to construct a series of 
connected, multi-layered datasets that would help them not only to observe attitudinal shifts 
in relation to specific crimes, but also to examine changes at aggregate levels over a 30-year 
period. Ultimately, the dataset comprised of individual-level data such as victimisation and 
social attitudes and aggregate-level data such as socio-economic indicators, official statistics 
on crime, public opinion data, and policy documents. Although not hailing big data as 
the magic bullet, Gray and her colleagues were able to harness the power of computational 
technology and statistical techniques to exploit the richness of high-volume datasets. 

Given the need for computational skills and in light of the previous chapter’s focus 
on, amongst other things, the divide between social and data scientists, readers may at this 
stage be asking whether they too will be able to develop the necessary skills-set to make use 
of big data. This takes us to the next lesson which has to do with a radical shift in the way 
social scientists and humanists think about research. 


3.2.4 Lesson 4 


For the analysis of big data to truly yield answers to society’s biggest problems, we 


must recognise that it is as much about social science as it is about computer science. 
— Justin Grimmer (2015:80) 


Develop specific skills to manage and analyse big data, but remember that these skills 
are tied to a number of caveats. 


One of the major implications of the big data evolution is that scholars working in the 
humanities and social sciences will require a fairly unique combination of skills that draw 
on the social sciences, computer science, and statistics (Miller 2011:1815). These skills are 
explored when we critically appraise big data software tools (in Chapters 7 and 8), but suffice 
it to say at this early stage, analysing big data involves thorough data profiling, “thoughtful 
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measurement ... [as well as] careful research design, and the creative deployment of statistical 
techniques” (Grimmer 2015:80). Not surprisingly, each of these phases comes with its own 
myriad set of pitfalls discussed below. 

When it comes to data profiling and measurement, political theorists John Patty and 
Elizabeth Penn (2015:100) remind scholars that “‘data’ is nothing until we use it to measure 
something, and ‘measurement’ is accordingly inherently theoretical”. Expressed a little 
differently, the decision as to what to describe and what to omit makes theory unavoidable 
(cf. Lemire & Petersson 2017). In their study of how novice data users made sense of large- 
scale datasets, Faniel, Kriesberg and Yakel (2012) found that one of the major concerns of 
social scientists measuring big data had to do with first putting their data in context because 
without it, data has no meaning (cf. Keim, Qu & Ma 2013:21). Putting data into context 
has to do, amongst other things, with profiling, which encompasses evaluating the content 
and quality of datasets. We touched upon assessing quality a little earlier on in this chapter, 
but in the age of big data, it is not simply a case of transferring data into a statistical package, 
describing the basic features of the data, and then summarising those features. More often 
than not, social scientists who attempt to obtain (or ‘scrape’) big datasets are confronted by 
reams of raw or unstructured data. More than a decade ago, Geoffrey C. Bowker (2005:183- 
184), an expert in informatics at the University of California, wrote that “[raw] data is both 
an oxymoron and a bad idea; to the contrary, data should be cooked with care”. Cooking the 
data is not without its challenges since the process is inescapably subjective in nature. Almost 
a decade ago, Bollier (2010:13) raised this concern when he asked “[c]an the data represent 
an ‘objective truth’ or is any interpretation necessarily biased by some subjective filter or the 
way that data is ‘cleaned’?”. Coupled to this subjectivity is the problem of what boyd and 
Crawford (2012:668) refer to as the tendency to practise apophenia — “seeing patterns where 
none actually exist”. They refer to physicist, mathematician, and computational scientist 
Leinweber (2007:1), who describes how, together with his colleagues, he was able to show 
that data mining techniques can easily be manipulated to produce ridiculous correlations in 
the financial world: 


The example in this paper is intended as a blatant example of [a] totally 
bogus application of data mining in finance. We first did this several years 
ago to make the point about the need to be aware of the risks of data 
mining in quantitative investing. In total disregard of common sense, we 
showed the strong statistical association between the annual changes in the 
[Standard & Poors] 500 stock index and butter production in Bangladesh, 
and other farm products. Reporters picked up on it, and it has found its 
way into the curriculum at Stanford Business School and elsewhere. We 
never published it since it was supposed to be a joke. 
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An important lesson behind this hoax is that social scientists and humanities scholars 
need to be explicit about their methodological processes, elucidating how the interpretation 
of their data is clouded by their biases. Boyd and Crawford (2012:668) suggest that one 
way to be accountable for bias is to “[recognize] that one’s identity and perspective [inform] 
one’s analysis” every step of the way, which is not an alien notion for anyone carrying out 
qualitative research. 

An additional challenge when it comes to data profiling pertains to the huge 
amounts of unreliable datasets available in cyberspace. In this regard, Desouza and Smith 
(2014:42) point to an ominous practice by the American Petroleum Institute in 2011 which 
involved manipulating sentiment via Twitter to such a degree that they were able to create 
the impression that vast numbers of farmers, environmentalists, and landowners supported 
a pipeline project between Alberta in Canada and Texas in the USA. The Rainforest Action 
Network (RAN) became suspicious when they discovered an unusual spike in pro-pipeline 
tweets. They were ultimately able to prove that many of these messages were generated via 
automated twitter bots.*° 

On the topic of Twitter, scholars need to be cautioned about using social media 
platforms in general to collect and analyse data because these platforms entail a series of 
troubling methodological concerns. Below are a number of caveats issued by boyd and 
Crawford (2012:669) when it comes to mining data from Twitter. Their warnings could just 
as easily apply to other social media platforms. 


1. Twitter does not represent the global population. 

2. Some Twitter accounts are generated by social bots that people may be convinced are 
authentic internet personas. 

3. Twitter accounts may reflect either active users or passive participants — individuals who 
actively post tweets versus those who are simply “listeners” (Crawford 2009:525). 

4. An individual who tweets may hold multiple Twitter accounts, while the reverse may 
also be true: an account could be used my multiple individuals. 

5. Data from Twitter may be skewed in the sense that its gatekeepers are able to filter out 
posts deemed to be uncivil or racist, for example. 

6. Even more interestingly, people have ‘fire-hose’, ‘garden-hose’, or ‘spritzer’ access to 
Twitter. Twitter Incorporated’s so-called fire-hose supposedly contains all public tweets 
that have been posted to date, while garden-hose access reflects approximately ten 
present of all public tweets. A spritzer contains a mere one percent of these tweets. It 
may surprise (qualitative) scholars unfamiliar with the nature of big data to know that 
very few researchers actually have access to the fire-hose. 


26 These bots autonomously perform tweeting and re-tweeting; they follow tweeters and control their 
accounts via the Twitter API. API is an abbreviation for application programming interface; it is a 
softeware intermediary enabling two applicants to communicate with each other. 
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At the heart of the matter, then, is that determining data context does not simply 
hinge on exploiting what technology has to offer in terms of web scraping, text mining, data 
integration, pattern recognition, and the like. Instead, big data analysis demands special 
domain knowledge and expertise, not to mention the proper skills to analyse the given 
dataset accurately (boyd & Crawford 2012). 

When it comes to research design and data analysis, both digital and social media 
researchers have not been swift to take advantage of what some scholars refer to as the 
computational turn or “data gold rush” (Kennedy, Moss, Birchall & Moshonas 2015:172). 
In fact, after conducting a meta-analysis of big data social media research by communication 
studies scholars, Mylynn Felt (2016) discovered that very few of them opted to exploit big 
data analytics. Felt (2016:13) speculates that low usage of data analytics may be due to 
the methods being so different from traditional social science research methods. Yet data 
analytics can certainly complement more traditional methods. Bogdan Batrinca and Philip 
Treleaven (2015) provide, amongst other things, a useful taxonomy of social media analytics 
tools for analysis of large datasets. This taxonomy includes scientific programming tools, 
business toolkits, social media monitoring tools, text analysis tools, and data visualisation 
tools. In subsequent chapters, we review these tools (as well as the relevant techniques and 
platforms associated with them), placing emphasis not only on the methodologies that 
underlie them, but also on critiquing these tools, the majority of which are unfortunately 
“commercial, expensive and difficult for academics to obtain full access [to]” (Batrinca & 
Treleaven 2015:92). 

Now which data analytics tool to select depends largely on the epistemological and 
theoretical underpinnings of any given scholar’s research (Felt 2016), and in this regard 
we turn to Rob Kitchin (2014b) for several lessons regarding emerging epistemological 
positions in the context of the social sciences and humanities. 


3.2.5 Lesson 5 


Big Data and new data analytics are disruptive innovations which are reconfiguring 
in many instances how research is conducted. — Rob Kitchin (2014b:1) 


Consider developing a reflexive epistemology in the context of big data projects. 


Earlier in this chapter, we cautioned scholars to also think qualitatively about big data. 
Thinking qualitatively involves, amongst other things, a careful re-consideration of — and 
critical reflection on — “the epistemological implications of the unfolding data revolution” 
(Kitchin 2014b:1). Kitchin (2014b) observes that there are currently two paths in the natural, 
life, engineering, and physical sciences, paths that reflect entirely dissimilar epistemologies, 
namely, empiricism and data-driven science, which we touched upon a little earlier on. While 
empiricism involves collecting data and then letting that data speak for itself (sans theory), 
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data-driven science adheres to the principles of the scientific method. Presently, many 
scholars advocate the need for data-driven science, predicting that it “will become the new 
paradigm of scientific method in an age of big data because the epistemology favoured is 
suited to extracting additional, valuable insights that traditional, ‘knowledge-driven science’ 
would fail to generate” (Kitchin 2017:32). 

Significantly, and as we noted in the previous chapter, big data and new analytics 
will probably not result in entirely new paradigms for researchers in either (computational) 
social science or the (digital) humanities, “given the diversity of their philosophical 
underpinnings” (Kitchin 2014b:12). Kitchin (2014b:9) makes the following observations 
about computational social science and the digital humanities in this regard: 


Whereas most digital humanists recognize the value of close readings, and 
stress how distant readings complement them by providing depth and 
contextualization, positivistic forms of social science are oppositional to 
post-positivist approaches. The difference between the humanities and 
social sciences in this respect is because the statistics used in the digital 
humanities are largely descriptive — identifying and plotting patterns. In 
contrast, the computational social sciences employ the scientific method, 
complementing descriptive statistics with inferential statistics that seek to 
identify associations and causality. In other words, they are underpinned 
by an epistemology wherein the aim is to produce sophisticated statistical 
models that explain, simulate and predict human life. 


Although neither big data empiricism nor data-driven-science appears to be 
making significant inroads when it comes to the epistemologies of these two disciplines 
(Kitchin 2014b:7), big data nevertheless “presents a number of opportunities for social 
scientists and humanities scholars” (Kitchin 2014b:10) who need to view big data far more 
critically than commercial analysists do (Mahrt & Scharkow 2013:25). When critically 
reflecting on the kinds of epistemologies that could be combined with big data, we take a 
leaf out of the book of Mahrt and Scharkow (2013:30), who suggest that scholars should 
focus on combining data-driven and theory-driven operalisation strategies, but with a 
greater emphasis on the latter. Coupled to this recommendation is a proposal related to 
research design as well as to the methodological training that social scientists and humanists 
should receive: 


In general, we should resist the temptation to let the opportunities and 
constraints of an application or platform determine the research question; 
the latter should be based on relevant and interesting issues — regardless of 
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whether something is available through an API platform or seems easily 
manageable with a given analytical tool. Methodological training ... 
should not only focus on computational issues and data management, but 
also continue to stress the importance of methodological rigor and careful 
research design. This includes a strong need for theoretical reflection, 
in clear contrast to the alleged ‘end of theory ...” (Mahrt & Scharkow 
2013:30). 


Both digital humanities and computational social science have their detractors. The 
former is frequently criticised for its “weak, surface analysis ... [and its] overly reductionist 
and crude ... techniques” (Kitchin 2014b:8). The latter is seen “as being mechanistic, 
atomizing, and parochial, reducing diverse individuals and complex, multidimensional 
social structures to mere data points” (Kitchin 2014b:8). 


3.2.6 Lesson 6 


Thick Data can rescue Big Data from the context-loss that comes with the processes of 
making it usable. — Tricia Wang (2013:4) 


... there has been a lack of attention paid to the ethical obligation of transparent 
and complete reporting of studies using large-scale ... datasets. 


— Stuart Nicholls, Sinéad Langan and Eric Benchimol (2016:339) 


Big (social) data needs thick description and transparent reporting. 


Social scientists and humanists usually think about making small datasets “thick” (Latzko- 
Toth, Bonneau & Milette 2017:199), but according to Tricia Wang (2013:9), who describes 
herself as a global tech ethnographer, using thick description “is a great opportunity for 
qualitative researchers to position [their] work in the context of [big datas] quantitative 
results” as well. This view is echoed by a number of social scientists such as Felt (2016:14), 
who argues that computational analysis (combined with qualitative methods) can benefit 
from thick description because it enables scholars to see “both the big picture and the close, 
critical view”. This in turn enhances transparency, which “mandates that researchers have 
an ethical obligation to facilitate the evaluation of their evidence-based knowledge claims” 
(Moravesik 2014:665). Graham (2017:449) points out that “while the scientists methods 
can be paraphrased without any loss, in the humanities, the description itself is understood 
to be part of the method”. In this respect, thick description is crucial; it forms part of the 
researcher's thinking and sense-making process. 
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Tellingly, a review of the literature over the last decade signals a paucity of research 
on the connection between thick description and big data analysis. Most studies of this 
research concept continue to be tied to qualitative work and to small datasets. Even more 
significant is the fact that very few big data studies even mention transparency obliquely, 
although to their credit, Lazar et al. (2014:1205) explicitly refer to lack of transparency as 
one of the most critical mistakes made by the Google Flu Trends team referred to a little 
earlier. Both transparency and thick description are critical if we want to achieve ethical 
research and this brings us to the next important big data lesson. 


3.2.7 Lesson 7 
But its already public, right? — Emily Wolfinger (2016:1) 


Accessible online data does not always translate into ethical practice. 


Perhaps a rather unpalatable truth to have to digest is that although vast quantities of data 
are now available online, sound ethical practice cannot be disregarded by anyone carrying 
out big data research. The assumption persists that online data is public data and that ethics 
is therefore not an issue. Yet we cannot ignore what constitutes a ‘public’ or a ‘private’ space 
in Internet-based research (IBR). A literature review of IBR ethics supports the necessity 
of obtaining informed consent in private domains and waiving it in public domains 
(Convery & Cox 2012:51), but the private-public space distinction is a contentious one as 
borne out by questions such as the following: If an online participant regards his/her posting 
as private, but communicates in a public space such as Twitter or Instagram, is the space 
deemed to be private or public? Can a domain be semi-private or semi-public? The literature 
provides us with some partial — and in some cases, ambiguous — answers which are related to 
the issue of informed consent in IBR: 


1. “Public space is defined as the space that applies no restriction to interaction and 
communication, whereas isolated space (private space) is one that completely constrains 
communication (Georgiou, 2006)” Jang & Callingham 2012:76). 


2. “A study with archival data and pre-existing resources publicly available might waive 
obtaining informed consent if the study does not include individual information and 
sensitive topics” (Jang & Callingham 2012:76). 


3. “One position is that data posted in open spaces without password or membership 
restrictions would usually be considered in the public domain and so available for 
research use without the need for informed consent from individual contributors ...” 
(Beninger, Fry, Jago, Lepps, Nass & Silvester 2014:6). 
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4, Currently, there is no consensus as to whether or not scholars should obtain informed 
consent if they wish to conduct Internet-based research (Gao & Tao 2016:185). 


To complicate matters, some researchers working in the field of IBR contend that 
we cannot ignore what an online participant perceives to be a private or public domain. 
A study that focused on chat-room participants, for example, found that these participants 
perceived the chat-room space to be private (Hudson & Bruckman 2004). In a similar study, 
researchers discovered that bloggers may regard their blogs as constituting private diaries, 
and may therefore not be in favour of public scrutiny of their thoughts (Gao & Tao 2016). 
We feel so strongly about the importance of abiding by ethical guidelines in big data research 
that ethics is the focus of the next chapter. 
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Big Data needs big ethics 


[Big data technology] is inherently ethics-agnostic. — Mandy Chessell (2014:1) 


The above statement appears to be provocative, but Chessell (2014:1) goes on to qualify it 
by arguing that researchers themselves need to be more thoughtful when it comes to how 
they use the technologies at their disposal. In this chapter, we interrogate the ethical issues 
behind big data research, beginning with big data blunders that underscore the urgent need 
for a sound code of ethical practices. Thereafter, we consider some of the major challenges 
of big data ethics that pertain to the use of human subjects, the dilemma inherent in the 
public-private space debate, the need for informed consent and anonymisation of data, the 


problem of representativeness,”’ and the ethical drawbacks to the new digital ecosystem. 


4.1 Big data faux pas of the millennium 


Outrage has a way of shaping ethical boundaries. — Sarah Zhang (2016:1) 


Big data is not simply about major epistemological adaptations, but also about significant 
changes at the level of ethics (Ambrose 2015:214). Yet, the world has seen quite a number of 
embarrassing and serious gaffes in terms of big data ethics in the last decade, one of the biggest 
being the 2016 release by Danish researchers of 70000 OKCupid profiles, complete with 
information relating to users’ genders, personality traits, and sexual preferences (Kirkegaard 
& Bjerrekaer 2016). What should concern big data scholars is that in an attempt to pre- 
empt any criticism, the researchers inserted a disclaimer to the effect that “[s]ome may object 
to the ethics of gathering and releasing this data. However, all the data found in the dataset 
are or were already publicly available, so releasing this dataset merely presents it in a more 
useful form” (Kirkegaard & Bjerrekaer 2016:2). What is more, a quick scrutiny of the tweets 
that followed the publication of the study shows that the researchers did not immediately 
respond to users’ concerns about their lack of ethical consideration for OKCupid users.”* 
The only response that was forthcoming came from the lead researcher, Emil Kirkegaard, 
who tweeted, “No. Data is already public”.”” Following public outrage, OKCupid filed a 


27 See Chapter 3 as well. 
28 — https://twitter.com/KirkegaardEmil/status/730449904909324289. 
29 Ibid. 
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Digital Millennium Copyright Act (DMCA) and the OKCupid dataset was subsequently 
removed by the Open Science Framework (Resnick 2016). 

The “its-in-the-public-domain” argument is not new according to privacy and 
Internet ethics scholar, Michael Zimmer (2010). In fact, Zimmer (2010) exposed an ethics 
scandal in 2008 when Harvard sociologists Lewis, Kaufman, Gonzalez, Wimmer and 
Christakis (2008), in an attempt to track the evolution of friendships over time, published 
“Tastes, ties and time” which comprised an ostensibly anonymous dataset of 1700 college 
students. Unfortunately, the database could without too much effort be traced back to 
Harvard’s college class of 2009. The “already public” disclaimer was also employed by a 
former Apple engineer, Pete Warden, who, in attempting to create his own search engine, 
exploited a flaw in Facebook’s architecture to collect 215 million Facebook users’ accounts 
in 2010. Shortly after this mining exercise, Warden announced that he would make this 
database public for purposes of academic research (cf. Zimmer 2016:2). He deleted the 
entire database following Facebook's decision to take legal action against him (cf. Zimmer 
2016:4). 

Clearly, Zimmers (2010) earlier warnings about ethics have not been heeded 
by all; as early as 2010, and in the context of the Harvard Facebook scandal, Zimmer 
(2010:323-324) wrote: 


The ... research project might very well be ushering in “a new way of doing 
social science,” but it is our responsibility as scholars to ensure our research 
methods and processes remain rooted in long-standing ethical practices. 
Concerns over consent, privacy and anonymity do not disappear simply 
because subjects participate in online social networks; rather, they become 


even more important. 


Zimmer (2010:323-324) is quick to point out that the aim of his study was not to 
condemn or disparage the Harvard analysts, but rather to present their research as a case 
study of the ethical challenges inherent in big data research. 

Perhaps the worst ethics scandal to rain on big data research to date is reflected in a 
2012 study published in Proceedings of the National Academy of Sciences by Kramer, Guillory 
and Hancock (2014) in which 700000 Facebook users’ news feeds were subtly manipulated 
to create either negative or positive messages. Kramer et al. (2014:8788) concluded that 
“[e]motional states can be transferred to others via emotional contagion, leading people 
to experience the same emotions without their awareness”. What is highly problematic is 
that Facebook users were not informed that their news feed algorithms would be tweaked, 
resulting in major outcries from these users after the study was made public. The most that 
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Adam Kramer would say on Facebook was that “[in] hindsight, the research benefits of the 
paper may not have justified all of this anxiety”.*° 

One of the oddest big data studies to showcase ethical dilemmas is by Michelle 
Hauge and her colleagues published in the Journal of Spatial Science in 2016. Using a 
statistical inference technique known as geographical profiling, which is essentially employed 
to find individuals suspected of serious crimes such as rape, murder, and terrorism, Hauge, 
Stevenson, Rossmo and Le Comber (2016) attempted to track down a well-known 
pseudonymous graffiti artist based in the United Kingdom, Banksy, who prefers to stay 
entirely out of the public eye. The researchers even mined electoral rolls and other public 
websites to trace not only the artist’s former residential addresses, but also those of his wife. 
As Metcalf and Crawford point (2016:2) out, 


[t]here are many questions that could be asked of this study, not least about 
the correlation between graffiti and terrorism. But for our purposes, we 
will only focus on the ‘ethical note’ that appeared at the end of the article: 
“the authors are aware of, and respectful of, the privacy of [subject name 
removed] and his relatives and have thus only used data in the public 
domain” (Hauge et al. 2016: 5). This claim is particularly striking, as it is 
difficult to see how tracking a specific individual (and their family) to such 


an invasive degree could be considered respectful of their privacy. 


Metcalf and Crawford (2016:2) go on to argue that this research offers a case study 
for the invasiveness of public data and for the mistaken belief among big data analysts that 
data’s existence in the public domain means that ethical clearance is never necessary. 


4.2 A review of the literature 


4.2.1 Current challenges 


When applying ethical inquiry to any new method or practice, it can easily shift into 
ethical relativism, where no fixed principles universally apply to any situation that 
may arise. — James Willis, John Campbell and Matthew Pistilli (2013:7) 


Eventually, ethicists will have to continue to discuss ... how we can prevent the 


abuse of Big Data as a new found source of information and power. 
— Andrej Zwitter (2014:5) 


30 _ https://web.facebook.com/akramer?_rdc=1&_rdr. 
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A plethora of studies have been published over the last few years that detail researchers’ 
concerns about some big data analysts’ disregard for ethical considerations.*' Legal scholars 
Richards and King (2014:395), for example, argue that 


[w]e are building a new digital society, and the values we build or fail to 
build into our new digital structures will define us. Critically, if we fail to 
balance the human values that we care about, like privacy, confidentiality, 
transparency, identity, and free choice, with the compelling uses of big data, 
our big data society risks abandoning these values for the sake of innovation 
and expediency. 


Other researchers who echo these sentiments include Davis (2012), boyd and 
Crawford (2012), Lyon (2014), and Metcalf and Crawford (2016). The latter two scholars 
identify emerging problems that are currently related to ethics and big data research. The first 
has to do with the ever-increasing divide between research ethics in traditional disciplines 
and the methodologies underlying big data: “[big] data research methods exacerbate a long- 
standing tension between the social sciences and research regulations that are geared to the 
methods and harms of biomedical research” (Metcalf & Crawford 2016:1). In traditional 
disciplines, we have become used to exercising moral agency — protecting human subjects 
so as not to cause unjustified harm (Zwitter 2014:2). The second problem, which we have 
already touched upon, is one that Metcalf and Crawford (2016) paint as a North American 
dilemma — that US research policy regards studies that exploit digital data to pose minimal 
risks to human subjects because the data is in the public domain. In South Africa, the scenario 
is not quite the same, but it is nevertheless worrying that most websites dedicated to big data 
ethics focus exclusively on research for commercial purposes or government projects, while 
a search of South African universities’ stance(s) on big data ethics in the humanities/social 
sciences has signalled that ethics review boards have no formal guidelines on conducting 
big data research: “there are very few examples of how institutions respond to the ethical 
challenges and issues [in big data research]” (Prinsloo & Rowe 2015:61). South Africa’s 
(PoPI) Protection of Personal Information Act (Act No. 4 of 2013) is not at all useful, 
making no mention of big data whatsoever.” Indeed, South African attorney Jared Nickig 
(2017:1) argues that “PoPI ... is somewhat lacking in the nuance and sophistication needed 
to tackle the type of issues that might arise in the digital world”. On a positive note, South 
African scholars have begun to address ethcial concerns about the use of student data in the 
digital age. Prinsloo and Rowe (2015), for example, have called on scholars to ensure that 


31 A Google Scholar search of “big data ethics” from 2010 to 2018 yielded only 438000 results. 
32 https://www.saica.co.za/Portals/0/Technical/LegalAndGovernance/37544_pro25.pdf. 
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universities’ use of student data should at all times be legal, ethical and fair (Prinsloo & Rowe 
2015:59). Drawing on international trends in ethical concerns and taking the South African 
higher learning context into account, these scholars consider a number of best practices that 
researchers should adhere to in the field of learning analytics. These best practices include 
taking into account the benefits and unintended consequences of using big data, accepting 
that collecting, analysing and using student data is “a moral practice and duty” (Prinsloo & 
Rowe 2015:60), appreciating that “[learning] analytics should be student-centric” (Prinsloo 
& Rowe 2015:60), and adhering to transparent collection, analysis and utilisation of student 
data. Several other papers have explored the ethics of learning analytics in South Africa 
including those by Jordaan and Van Der Merwe (2015), Lemmens and Henn (2016), and 
Prinsloo and Slade (2017). 


4.2.2 The controversy surrounding human subjects 


Where are human subjects in big data research? 
— Jacob Metcalf and Kate Crawford (2016:1) 


In an attempt to try to understand why we find it difficult to align big data with ethical 
research, it may be helpful to consider that we are in all likelihood accustomed to thinking 
about ethics in terms of protecting individuals from physical harm or data discrimination, 
for example, and that big data tends to divorce us from these concerns: “[it creates] an 
abstract relationship between researchers and subjects, where work is being done at a distant 
remove from the communities most concerned, and where consent often amounts to an 
unread terms of service or a vague policy” (Metcalf & Crawford 2016:2). A major dilemma 
in this regard is that the disciplines that have preceded data science, such as computer 
science, statistics, and mathematics, have not regarded themselves as having studied human 
subjects (Metcalf & Crawford 2016:2). 

Social scientists and humanists who make use of big data should be mindful that 
human subjects generate the data they are interested in (Rosenberg 2010:28), and therein 
lies one of the intricacies reflected in the ethics of digital research: scholars have to decide 
if what they are researching is person-based or text-based (McKee & Porter 2009:5), and if 
the former is at play, how human subjects’ rights will be protected. In addition, researchers 
need to determine whether informed consent should be sought or whether it can be waived. 
In the social sciences and humanities, what constitutes a human subject is debatable and it 
seems that the concept of ‘human subject’ is not a particularly useful one, given the diverse 
and oftentimes confusing definition of this concept in digital research (cf. Markham & 


33 Prinsloo and Slade’s (2016) paper considers both South African and British case studies of 
learning analytics. 
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Buchanan 2012:6). According to the Stanford Encyclopedia of Philosophy, a human subject 
may be defined as “a living individual about whom an investigator ... conducting research 
obtains 1. data through intervention or interaction with the individual, or 2. identifiable 
private information”. Some ethics scholars such as Lipinski (2009:58) argue that if there 
is no direct interference or interaction with the human subjects under investigation, then 
the subjects are not regarded as human and “informed consent ... is not relevant”. In terms 
of Lipinski’s (2009) framework, if a researcher collects data in the form of users’ comments 
posted below a YouTube clip in order to analyse the linguistic features of the comments, 
and he/she does not interact with those users, then they are not human subjects. As Rourke, 
Anderson, Garrison and Archer (2001:13) put it, “a researcher analyzing transcripts of a 
[computer] conference, without participating in the conference, has not intervened in the 
process and thus has not placed them in the position of research participants”. This stance 
is supported by other researchers who have examined ethical issues in the context of web 
research (e.g., Madge 2007; Rosenberg 2010; Whiteman 2010; Thelwall 2010). 


4.2.3 The public-private space conundrum 


... the online status of public and private is ambiguous and contested. 
— David Berry (2004:326) 


At this point, we would like to state that this does not mean that informed consent is never 
required when digital research is carried out. Here, it is necessary to take into account what 
constitutes a public or a private space, and this too is often a debatable issue as we have 
already observed. Since it is not possible to view the private-public space in terms of a 
straightforward dichotomy, a good practice is to perceive it along a continuum in order to 
assess any risks to the human subjects under investigation. We believe that the following 
scale (Table 4.1) developed by Jang and Callingham (2012) is a useful one for researchers to 
employ when it comes to assessing ethics variation and risk: 


34  http://plato.stanford.edu/entries/ethics-internet-research/#HumSubRes. 
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Table 4.1: Measuring ethics variation in digital research (Jang & Callingham 2012:75) 


Risk Researcher’s acts Research field and | Participants’ acts Data content 
level sub-contexts 
LOW | Auto-participant Public (open Overt without Published without 
(being a participant) | to anyone, no identifiable sensitive topic 
registration information 
required to access (public) 
information) 
Participant observer | Semi-public (open Overt with Published with 
(being one of the to most people, identifiable sensitive topic 
participants) but registration information 
is required for (public) 
participation) 
Open observer Semi-private Covert without Unpublished/ 
(participants know (registration is identifiable hidden without 
they are being always required to information sensitive topic 
observed) access information, (private) 
but people can sign 
in or up) 
Passive observer Private (registration | Covert with Unpublished/ 
(lurking) is always required identifiable hidden with 
and only people information sensitive topic 
with permission can | (private) 
HIGH get access) 


If a researcher opts to collect commenters’ postings to Mail & Guardian Online, 
an online news site, then in terms of the scale in Table 4.1, the site constitutes a semi- 
public space for a number of reasons. First, Mail & Guardian Online is open to anyone, but 
readers who wish to discuss an article are required to first sign in with Facebook, Twitter, 
Google, Disqus or Thought Leader. Second, their comments can be read by anyone who 
visits the news site.” Clearly then, commenters’ acts are overt with identifiable information. 
Third, comments made may be sensitive in nature if they are made in relation to a contentious 
topic discussed in a given online article. According to some scholars such as Convery and 
Cox (2012:54) and Warrell and Jacobson (2014:28), it is only research into private and 
semi-private domains (such as closed chat rooms or email correspondence) that requires 
informed consent from the participants. Discussing research by Sveningsson Elm (2009), 
Rosenberg (2010:33) argues that either public or semi-public spaces can be studied without 


obtaining informed consent. 


35 Another space open to readers is eNCA’s online comments section. 
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4.2.4 The culture of informed consent on the Internet and anonymisation 


. it is increasingly argued that consent distorts results, that consent is prohibitively 
burdensome for many studies; and that perhaps, in the era of “big data’, consent can 
be removed to facilitate major discoveries ...”. — John Ioannidis (2013:40) 


In addition to the challenge of trying to define private versus public spaces, big data 
researchers who explore aspects of social media also encounter potential problems when 
it comes to the culture of informed consent on social media platforms. In her study of 
Facebook and consent, Anja Bechmann (2014:22) questions the legality of the agreement 
between this site and its users, expressing doubt that it constitutes informed consent — that 
users actually read and understand its privacy policies, and that they are therefore aware that 
what they post online could be employed by third-party stakeholders. Her view is certainly 
borne out by a number of researchers, notably Acquisti and Gross (2006) and Govani and 
Pashley (2005), who have found, amongst other things, that the majority of users have 
never read Facebook's privacy policies or terms of services and that they have a limited 
understanding of the information contained in these documents. Bechmann (2014:33) 
states that this behaviour points to a non-informed consent culture because “[in] a legal 
sense, one could argue that ‘informed consent’ does not take place”. 

A users acceptance of Facebook’s privacy policies does not necessarily mean 
that most researchers are exempt from asking that user for his/her informed consent to 
participate in a particular study. However, this is easier said than done even in a small-scale 
study. This is because it is not simply a matter of asking individual users for their consent: 
“many secondary persons (e.g., friends of the Facebook participants, and conversation 
partners of Twitter profile holders) are involved” (Lomborg & Bechmann 2014:21) and so 
metadata is generated. It would not be practical to ask a// users to provide their informed 
consent. Lomborg and Bechmann (2014:21) advise researchers to obtain consent from 
primary users and “[to] assess possible privacy problems on behalf of [secondary] users and 
only use the data as documentation to the extent that other measures of privacy protection, 
such as anonymization, are put to use”. When it comes to big data analysis of social media 
sites, it is not feasible to seek informed consent; instead, “the legal and ethical challenges ... 
revolve around how data is anonymized both to the researcher and when presenting results” 
(Lomborg & Bechmann 2014:21. Strategies to protect users include scrubbing the data by 
removing identifying names and protecting the archives collected and stored by means of 
encryption. Storage safeguards and de-identification of individuals are strategies that are in 
line with South Africas Protection of Personal Information Act 4 of 2013. 

Yet, even anonymised data can be de-anonymised which we saw in the case where 
Michael Zimmer was able to trace Harvard sociologists’ dataset to a specific group of college 
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students. We question if it is at all practical to try to anonymise data in the era of big data 
when it is so easy to trace identities via the semantic web (cf. Dawson 2014) even when a 
user’s direct quotes have not been published. Dawson (2014) recommends that in light of 
this reality, it may be best to explicitly advise research participants of the risks involved. 
He adds that institutional review boards generally acknowledge that it is not entirely possible 
to preserve the anonymity of users in online spaces. 


4.2.5 Ethics and the problem of representativeness 


Big data continues to present blind spots and problems of representativeness, precisely 
because it cannot account for those who participate in the social world in ways that 
do not register as digital signals. — Kate Crawford, Kate Miltner and Mary Gray 
(2014:1667) 


We briefly touched on another ethical issue in the previous chapter when we discussed the 
fact that data gleaned from social media platforms such as Twitter may be unreliable when 
it comes to representativeness. Some of the concerns we raised are echoed by Kari Steen- 
Johansen and Bernard Enjolras (2015:127): 


Twitter studies have become very popular internationally, especially due 
to the availability of data. However, questions may be raised as to what 
analyses of Twitter posts represent. An obvious challenge is that Twitter 
users only constitute a certain selection of the population. Other issues 
are linked to the fact that there is no one-to-one-relationship between user 
accounts and actual people. One person can have several accounts, several 


people can use the same account, and accounts can also be automated — so- 
called ‘bots’. 


Coupled to these problems is the phenomenon of lurkers on Twitter — users who 
read other people’s tweets without posting any messages, a phenomenon which has also been 
observed on Facebook (Brandtzeg 2012). A possible solution to this complication “might be 
to argue that the use of [big data] makes it possible to analyze social media sites like Twitter 
or Facebook on an aggregate rather than an individual level, and in this way paint a picture 
of these social media as public spheres based on whatever topics are being discussed and 
distributed” (Steen-Johansen & Enjolras 2015:128). 
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4.2.6 A new digital ecosystem 


The current ecosystem around Big Data creates a new kind of digital divide: the Big 
Data rich and the Big Data poor. — Danah boyd and Kate Crawford (2012:674) 


Another critical ethical issue pertains to the fact that the computational turn has led to the 
creation of what boyd and Crawford (2012:674) call a big data ecosystem in which we can 
now distinguish between the big data haves and have-nots. In this respect, it is possible to 
identify two important elements that have contributed to this divide. The first relates to the 
reality that data has been privatised; it is ‘owned’ by companies such as Facebook, Google, 
and Twitter, which makes data access for academic researchers quite difficult (Steen-Johansen 
& Enjolras 2015:131). What is more, researchers who work for a particular company enjoy 
full access to all available data and are not obligated to obtain informed consent from users 
who would have had to accept specific terms and conditions in order to use the company’s 
services in the first place (Steen-Johansen & Enjolras 2015:131). The second element is 
an interesting one that has to do with the fact that everyone can now conduct research, 
given the growing disconnect between research activity and traditional research institutions 
(Steen-Johansen & Enjolras 2015:131). An accompanying and worrying trend is that the 
kind of big data research being carried out is not necessarily based on appropriate theoretical 
foundations, while data quality and representativeness are being thrown out the window 
(Steen-Johansen & Enjolras 2015:131). 

What “[big data] collectors, [big data] utilizers, and [big data] generators” (Zwitter 
2014:3) are currently grappling with in the big data ecosystem is datafication (van Dijck 
2014). In the first chapter, we introduced readers to the notion of datafication, the process 
whereby social action is transformed into huge amounts of computerised data to be studied 
(Mayer-Schénberger & Cukier 2013). Interestingly, van Dijck observes that in the evolution 
of big data, datafication has become normalised: “[datafication] as a legitimate means to 
access, understand and monitor people’s behavior is becoming a leading principle, not just 
amongst the technoadepts, but also amongst scholars who see datafication as a revolutionary 
research opportunity to investigate human conduct” (van Dijck 2014: 198). In terms of 
ontology and epistemology, big data proponents claim that datafication is a neutral and thus 
innocuous paradigm for understanding social behaviour (van Dijck 2014:198) particularly 
in the context of predictive analytics, which uses techniques such as data mining, artificial 
intelligence, and statistics to make predictions about the future or about unknown events, 
thereby also generating new personal information and knowledge about individuals. 
This picture is insidious for a number of reasons, not least of which is the fact that this 
information has become “a commodity that is sold and traded among information empires 
and data brokers” (Mai 2016:192); big data companies have control over who may and 
may not access the information they generate (cf. boyd & Crawford 2012). Further, the 
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ordinary citizen “does not have the power ... to control the flow of information into and 
among information empires and data brokers” (Mai 2016:197). An additional problem may 
be couched in the form of a question: Can we trust predictive analytics? On one level, flawed 
methodologies in big data analysis may be harmful if they result in inaccurate predictions. 
On another level, what Kate Crawford and Jason Schultz (2014:94) refer to as “predictive 


privacy harms” may also occur: 


... [In] 2012, a well-publicized New York Times article revealed that the 
retail chain Target had used data mining techniques to predict which 
female customers were pregnant, even if they had not yet announced it 
publicly. This activity resulted in the unauthorized disclosure of personal 
information to marketers. In essence, Target’s predictive analytics “guessed” 
that a customer was pregnant and disclosed her name to their marketing 
department, manufacturing [personally identifiable information] about 
her instead of collecting it directly. Although the customers likely knew 
that Target collected data on their individual purchases, it is doubtful that 
many considered the risk that Target would use data analytics to create 
such personal customer models to send advertising material to homes 


(Crawford & Schultz 2014:94). 


Finally, in addition to privacy concerns, predictive analytics could lead to probability 
harms. Dennis Hirsch (2014:351) points out that according to Mayer-Schénberger and 
Cukier (2013), algorithms could forecast the likelihood of an individual suffering a heart 
attack, reneging on a home loan or committing a crime, for example. In this way, the 
individual’s chances of obtaining an insurance policy, buying a home or securing a job could 


be compromised.*° 


4.3 Data justice 


...an idea of data justice — fairness in the way people are made visible, represented 
and treated as a result of their production of digital data — is necessary to determine 
ethical paths through a datafying world. — Linnet Taylor (2017:1) 


Professor of data ethics Linnet Taylor (2017:1) urges scholars to support the notion of data 
justice—treating the participants under investigation fairly and transparently — by constructing 
three pillars she refers to as visibility, engagement with technology, and non-discrimination. 


36 Having said this, even in data mining the target variable is based on probability, and the data miner 
needs huge amounts of high-quality data to perform accurate data mining/prediction. This means 
that it is not that easy for an individual to be compromised. 
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The first pillar relates to privacy and representation, which Taylor (2017:9) argues should be 
accompanied by visibility and respect for informational privacy. Furthermore, this pillar calls 
for awareness of the risks that might occur if, through, collective profiling, group privacy is 
not protected (cf. Floridi 2016; Raymond 2016). The second pillar has to do with rejecting 
the notion of being treated as subalterns — individuals should enjoy the freedom to choose 
not to use specific technologies and to determine to what degree they would like to be visible 
to data markets. The final pillar which is non-discrimination is made up of “the power to 
identify and challenge bias in data use, and the freedom not to be discriminated against” 
(Taylor 2017:9). 

The ethical considerations discussed here merely scrape the surface, and we will 
return to these considerations as well as to the idea of data justice in Chapter 6 when we 
assess how “data power” (Kennedy & Hill 2017a:769) affects not only the ever-increasing 
divide between scholars who collect data and individuals targeted by data collection, but also 
data visualisation, which is the focus of the next chapter. An entire chapter is devoted to data 
visualisations because they “may ... incorporate conscious or unconscious bias in those who 
have prepared them” (Fuller 2017:93). 
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Does big dala visualisation make our 
endeavours less humanistic? 


...visuals allow for simple and powerful communication of data, while also serving 
as a tool for research development. — Malu Gatto (2015: 9) 


It will come as no surprise to practitioners of data science that for many humanists, 
visualisation in the era of big data remains a foreign element. Although visualisation is a 
useful tool for illustrating aggregate data in a comprehensible way,” questions that scholars 
in the fields of literary studies and English studies, for example, are currently asking, point 
to a level of mistrust of data visualisation techniques. These questions include the following: 


How do literary keywords such as style, influence and genre demand 
rethinking in the context of quantitative analysis? Do we necessarily 
abandon a theoretical commitment to foregrounding the material basis 
of texts when we use digital methods of “distant reading” to understand 
textual transmission? ... how can visual tools and practices rework, rather 
than refigure, verbal information? (Graham 2017:449). 


What may at first appear to be discouraging is that data visualisation and the 
humanities are indeed on the face of it irreconcilable for the simple reason that scholars in 
the humanities disciplines generally subordinate objective measurements to interpretation 
(Bradley, Mehta, Hancock & Collins 2016:1). Yet, many researchers in the humanities have 
in recent years taken up the challenge of marrying the humanities and visualisation in ways 
that are both innovative and productive. Before we consider their suggestions, we need to 
take into account the fact that if a particular visualisation is to be effective, in the sense of 
enabling individuals to not only comprehend it, but also interact with it, then their designers 
need to consider both the cognitive and perceptual processes and limitations of the human 
mind (Olshannikova, Ometov, Koucheryavy & Olsson 2015:9). On a cognitive level, poorly 
designed visual displays may lead, amongst other things, to ambiguous interpretation of 
data (Burkhard & Eppler 2005), cryptic encoding (Tufte 1986), the obscuring of important 
insights (Few 2006; Kosslyn 2006), over-complication in how information is represented 


37 It should be noted that data visualisations may be visual and (occasionally) auditory. 
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(Few 2006), and the absence of adherence to Gestalt principles (Tufte 1986). These and 
many other consequences of poor design are discussed in the section below to highlight what 
Kennedy and Hill (2017a:769) have referred to as “the pleasure and pain” of visualisation. 


5.1 Datavisualisation, human cognition, and human perception 


Visualisation is not unique to the computer science domain. 


— Maureen Stone (2009:44) 


With data ever-increasing in quantity and becoming integrated into our daily lives, 
having effective visualizations is necessary. — Michelle Borkin (2014:iii) 


There is no question that visual representations of highly intricate information and 
knowledge demand higher-level cognitive functions which include, but are not limited 
to, recall, reasoning, understanding, and insight (Patterson, Blaha, Grinstein, Liggett, 
Kaveney, Sheldon, Havig & Moore 2014:42). Patterson et al. (2014:44) theorise that these 
functions are enabled through top-down cognitive processes, a theoretical perspective which 
challenges the traditional notion that the connection between perception and cognition is 
relatively simple, involving “bottom-up engagement of low-level, feature-detection processes 
that sequentially feed into higher-order cognition” (Patterson et al. 2014:44). While these 
researchers foreground top-down processing in their human cognition framework, they 
nevertheless argue that bottom-up and top-down processes interact in a dynamic way: the 
former type of processing essentially guides the way in which the latter type is processed, 
resulting in the activation of so-called “organized knowledge structures in long-term 
memory” (Patterson et al. 2014:44). This interplay has been observed by other researchers, 
notably, Grill-Spector and Kanwisher (2005), Oliva and Torralba (2006), and Biederman 
(1981), who have carried out tests to determine how fast and/or accurately individuals are 
able to detect meaning from visual stimulation. 

Building on dual-process theory, Patterson et al. (2014:44) go on to propose that 
human cognition involves interaction between dual systems — an analytical or reasoning 
system and an autonomous or intuitive system. The analytical system allows for pattern 
recognition, analogical reasoning as well as deductive reasoning, while the intuitive system 
by contrast is responsible only for pattern recognition. Both systems rely on top-down and 
bottom-down processing as well as encoding during which a new visual image is converted 
into a neural representation in a person’s short-term memory (Patterson et al. 2014:45). 
What is important for scholars designing big data visualisations to consider is the fact that 
it is only information extracted during encoding that can be utilised for any processes that 
follow. Attention, a process during which the brain filters and selects specific information, 
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and working memory, the ability to retain that information, interact during encoding. 
For visualisations to be effective, they need to be designed in such a way that there are no 
attention distractions present, particularly when the individual viewing the visualisation is 
performing another task at the same time (Patterson etal. 2014:45). One of the consequences 
of attention interference is that memory retrieval may be impeded (Dudukovic, Dubrow & 
Wagner 2009:953), which is problematic given that long-term memory, which interacts 
with working memory and pattern recognition, is ultimately required to store neural 
representations of information (Patterson eż al. 2014:46). Attention interference may also 
impact negatively on the decisions an individual makes about a data visualisation (cf. Craik 
2014:4). 

Given these complex cognitive processes, Patterson et al. (2014:47) recommend that 
data visualisations be designed with specific leverage points or strategies in mind, the first of 
which is intended to achieve bottom-up, stimulus-driven attention: a visualisation should 
reflect “salient cues to drive exogenous attention, alerting users to changes in or important 
attributes of [the] visualisation”. What this essentially means is that the visualisation should 
be designed so that it draws an individual’s attention to a stimulus; if this does not happen 
then the result is “inattentional blindness” (Simons 2000:147) — the individual’s failure to 
consciously register specific attributes of the visualisation because his/her attention has been 
directed elsewhere. To overcome inattentional blindness, Patterson eż al. (2014:48) suggest 
that the researcher incorporate striking design elements or cues into a visualisation that the 
individual cannot easily ignore. One such cue is colour, and a review of the literature provides 
several interesting findings about why and how specific colours should be incorporated 
into data visualisations (cf. Lin, Fortuno, Kulkarni, Stone & Heer, 2013). Silva, Santos 
and Madeira (2011:326) argue that colour carries specific cultural connotations and that 
being mindful of this fact helps reduce a viewer’s cognitive load. For instance (and stating 
the rather obvious), when it comes to the visualisation of average maximum temperatures 
in Cape Town over six months, viewers will be drawn to Figure 5.1 below because South 
Africans associate shades of red and orange with high temperatures and blue shades with 
low temperatures. The choice of these colours helps focus the viewers attention (cf. Stone 
2009:48). Figure 5.2 illustrates the same data, but this time, the colours are reflected in 
the adjectives ‘hot’, ‘warm’, ‘mild’, ‘cool’, and ‘cold’. However, these colour choices are 
semantically incongruent when they should be semantically resonant; different shades of 
blue, for example, would not be evocative of hot or warm temperatures for most people and 
would therefore cause strong attention distraction (Lin et al. 2013:2). In cognitive science, 
contrasting colour cues result in interference, and this is referred to as the Stroop effect 
(see Figure 5.3), named after J. Ridley Stroop (1935) who discovered that processing one 
stimulus feature will hamper the simultaneous processing of another stimulus feature (cf. 
MacLeod 1991:163). 
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16 
January February March April May June 
Figure 5.1: Cape Town’s average maximum temperatures in degrees Celsius 
L 27 27 25 21 18 16 
L Hot Hot Warm Mild Cool Cold 
January February March April May June 
Figure 5.2: Semantically incongruent colour choices to represent temperatures 
YELLOW WHITE BLACK GOLD Say the colour 
GREY PURPLE ORANGE SILVER of the word, 
PINK GREEN RED BROWN not the word. 


Figure 5.3: The Stroop effect 


The second leverage point identified by Patterson et al. (2014:47) is geared 
towards achieving top-down, voluntary attention, and so data visualisation needs to reflect 
“appropriate organization of material or interaction options to assist endogenous attention 
and minimize distracting information” (Patterson et al. 2014:48). To focus attention as well 
as reduce any distractions so that information can be encoded and then stored in working 
memory, Patterson et al. (2014:48) recommend the use of what they call “endogenous 
attentional resources” such as clear labels or arrows that guide the viewer towards the 
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relevant information as well as the elimination of extraneous information. As far as the 
latter is concerned, a number of studies have explored how visual clutter impacts a viewer's 
cognitive processing (Barvo & Farid 2004; Doolittle, McNeill, Terry & Sheer 2005; Ellis & 
Dix 2007; Rosenholtz, Li & Nakano 2007): “if too much visual information is presented 
... then the visual channel’s capacity will be exceeded, leading to insufficient processing of 
that visual information” (Doolittle et al. 2005: 198). Techniques for reducing clutter include 
limiting data (Sula 2012), removing uninformative or redundant detail, and experimenting 
with different layouts which also affect cognitive processing (Hornof 2004). 

The third strategy to bear in mind to facilitate cognitive processing of a visualisation 
has to do with chunking, which entails “[choosing] visualisation parameters that provide 
strong grouping cues ... which will minimize the effects of working-memory capacity 
limitations” (Patterson et al. 2014:48). Well-known chunking techniques are the use of 
common image parameters in the form of specific colours or shapes, for example (Patterson 
et al. 2014:48), as well as adherence to the principles of Gestalt theory (Wertheimer 1938), 
which proposes that as human beings, we tend to organise visual information into groups or 
patterns in an attempt to comprehend the picture as a whole (Quinn & Bhatt 2015:691). 
More digestible visual representations that follow Gestalt principles are those that exploit 
similarity, proximity, and enclosure.** Similarity is a principle according to which viewers of a 
visualisation will group visual attributes that they perceive as possessing similar characteristics 
related to colour, texture, and size, to name a few (cf. Kobourov, Mchedlidze & Vonessen 
2015). Thus, Figure 5.4 will automatically be perceived as having two elements — green and 
blue rectangles. The Gestalt law of proximity states that objects will be grouped together if 
they have been arranged close to one another (Figure 5.5), and this applies even if the objects 
reflect diverse characteristics (although Guberman (2015:28) points out that the objects still 
need to be similar in some sense). Enclosure/ closure states that we are predisposed “to close 
up objects that are not complete” (Olshannikova et al. 2015:17). In this respect, the World 
Wide Fund for Nature’s iconic panda symbol is a good example. Data visualisations may 
also utilise the principle of closure by surrounding specific groups with visual elements. In 
Figure 5.6, viewers can easily identify two distinct groups based on the fact that the designer 
has enclosed related elements in visually dissimilar boxes. 


38 Other Gestalt principles include continuity/continuation (where the visualisation enables the eye to 
be drawn from one object to another) and connectedness (where there are clear connections between 
various objects as typically seen in genealogical charts). 
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Figure 5.4: Gestalt principle of similarity 
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Figure 5.5: Gestalt principle of proximity 
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Figure 5.6: Gestalt principle of enclosure 


Organising information based on mental models (structural analogies of the world) is 
another tactic that may be useful to consider in visualisations because this kind of information 
supposedly activates “strong retrieval cues for knowledge structures in long-term memory to 
aid reasoning” (Patterson et al. 2014:49). In the literature, theories that attempt to explain 
how human beings reason abound, and include the theory of the meaning of conditionals 
(Johnson-Laird & Byrne 2002), or what Patterson et al. (2014:49) refer to as a mental model 
of “imagined possibilities” according to which inferences are generated; the probabilistic 
approach to human reasoning (Oaksford, Chater & Larkin 2000); and mental logic theory 
(O’Brien 2009). Although these theories are controversial,” many scholars concede that 
mental models play a significant role in reasoning (Johnson-Laird & Byrne 2002; Johnson- 
Laird 2010; Johnson-Laid & Khemlani 2015): 


39 For a critique of the mental model theory proposed by Johnson-Laird and Byrne (2002), see Evans, 
Over and Handley (2005), and for a critique of Oaksford, Chater and Larkin’s (2000) conditional 
probability model, refer to Schroyens and Schaeken (2003). Lépez-Astorga (2016) provides a sound 
critical appraisal of the mental logic theory. 
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Human reasoning is not simple, neat, and impeccable. It is not akin to a 
proof in logic. Instead, it draws no clear distinction between deduction, 
induction, and abduction, because it tends to exploit what we know. 
Reasoning is more a simulation of the world fleshed out with all our 
relevant knowledge than a formal manipulation of the logical skeletons of 
sentences. We build mental models, which represent distinct possibilities, 
or that unfold in time in a kinematic sequence, and we base our conclusions 


on them (Johnson-Laird 2010:18249). 


Several researchers have exploited what we know about mental models and reasoning 
to design more comprehensible visualisations. We have already touched on Lin et al. 
(2013), who have experimented with semantically resonant colours they believe will evoke 
specific associations in the minds of individuals because they are grounded in well-known 
linguistic and cultural conventions. In an interesting study, another team of researchers 
used. photographs, clipart pictures, and icons as visual embellishments, and found that 
they facilitated both memorisation and concept comprehension (Borgo, Abdul-Rahman, 
Mohamed, Grant, Reppa, Floridi & Chen 2012). 

The fifth leverage point recommended by Patterson et al. (2014:51) has to do with 
creating visual displays that reflect analogous patterns related to viewers’ mental models with 
a view to facilitating analogical reasoning. Analogical reasoning involves using the attributes 
of a source domain that is well understood to facilitate the understanding of a target 
domain that is not so well understood (Figure 5.7). Not to be confused with metaphorical 
graphics or visualisations,*° Risch (2008:4) argues that analogical visualisations reflect 
specific characteristics, the most important being “[to] express systematic relations among 
the elements of a target domain in terms of those of some source domain”. In addition, 
the elements of the source and target domains must be aligned (Risch 2008:4). Thus, for 
example, geographic maps and pictures are graphical analogues: the internal structure 
of these visualisations very closely resembles that of the phenomena they represent, and 
therefore allows viewers to bridge the gap between the known and the unknown. 


40 In metaphorical graphics, abstract concepts are expressed in such a way that the target domain is 
semantically distant from the source domain (Risch 2008:4). 
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Source domain Target domair 


(The known) (The unknown) 
¢.g., water flowing through pipes €g., electric circuit 
+ . 
Known elements Inferred elements 
pipes are like: —  swi's in an electric circuit 


the pump is like: eee the battery in an electric circuit 


Figure 5.7: Example of analogical reasoning (adapted from Daugherty & Mentzer 2008:10) 


The five recommendations just outlined are useful for scholars whether their 
research is informed by quantitative or qualitative methods, but humanists require even 
clearer direction since, as noted at the beginning of this chapter, the humanities and data 
visualisation appear to be ill-matched. Let us return now to what the literature tells us about 
the ways in which humanists may exploit data visualisations to provide insights into their 
qualitative findings. 


5.2 Technology and the humanities 


If we believe that technology determines our choices and that we are therefore 
not responsible but act only in accordance with a technological imperative, we let 
technology decide for us in the sense that we renounce responsibility for our actions. 
— Bjorn Hofmann (2006:2) 


What adds to the divide between humanists and big data scholars is not only the mistaken 
assumption that data visualisation is the sole domain of quantitative research, but also the fact 
that the visual display of qualitative data does not have a particularly long tradition. In the last 
decade, researchers who have explored how to visually express qualitative information 
include Slone, (2009), Verdinelli and Scagnoli (2013), Henderson and Siegal (2013), and 
Chandler, Anstey and Ross (2015). What humanists may find frustrating about these studies 
is that although they provide practical advice on displaying qualitative data through flow 
charts (Draucker & Martsolf 2008), ladders (Eriksson, Starrin & Janson 2008), matrices 
(LeGreco & Tracy 2009), networks (Cheyney 2009), and the like, some researchers have 
begun challenging the emphasis placed on how data should be displayed when “it must be 
understood that data visualization for the humanities needs to be built more for experience 
than demonstration of fact” (Bradley et al. 2016:3). In Drucker’s (2014:125) view, graphical 
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tools introduced to the humanities and then embraced by scholars “are a kind of intellectual 
Trojan horse, a vehicle through which assumptions about what constitutes information 
swarm with potent force”. 

For Bradley et al. (2016:2), a most useful starting point for humanists is to consider 
Martin Heideggers (1966, 1977) conception of technology since it provides an analogy 
for understanding the interaction between texts and technology.*! Looking through a 
phenomenological lens, Heidegger argues that technology should not be reduced to its 
instrumental value — that its value depends very much on its contexts of use® (cf. Ihde 
2010:152). In the humanities, computing’s instrumental purpose lies in its ability to process 
data as quickly as possible, but this does not constitute interpretation: “what is necessary is 
an understanding that the expressed purpose of technology itself (its instrumental existence) 
and our engagement with it (its anthropological potential) are two separate parts of a 
technological whole” (Bradley et al. 2016:2). 

In terms of this understanding, Fish (2012:1) contends that one of the flaws inherent 
in many digital humanities projects is the tendency to “first run the numbers, and then ... see 
if they prompt an interpretative hypothesis. The method, if it can be called that, is dictated 
by the capability of the tool” (cf. Bradley et al. 2016:2). To overcome this deficiency, Bradley 
et al. (2016:3) recommend an approach according to which humanists adopt what they call 
“slow analytics”, which takes into account both the time and space required to cognitively 
process information. 

To test this approach, these researchers recruited PhD students and academics who 
teach and/or publish on literary criticism or poetics. These participants were asked to analyse 
two poems using Livescribe Anoto (digital) pens* and paper so that the researchers could 
determine the processes involved in each participant’s analysis. Annotation of the poems 
(which were selected on the basis of being unknown to each participant to avoid expertise 
bias) was voluntary, and if a participant opted to annotate a poem, they were also asked 
to explain how their annotations functioned during a self-reflective discussion with the 


41 We acknowledge that Heidegger’s views on technology are not without their idiosyncrasies, and that 
they are sometimes obscure to say the least. In this respect, Godzinski (2005) provides an insightful 
overview of Heidegger’s controversial philosophy of technology. 


42 Ofcourse, we concede that this argument should be moderated since “[t]echnologies reveal different 
things for different people” (Loukissas 2012:22). In addition, Heidegger himself did not view 
technology as something to be avoided, a view exemplified in his ‘Memorial address’: “For all of us, 
the arrangements, devices, and machinery of technology are to a greater or lesser extent indispensable. 


It would be foolish to attack technology blindly” (Heidegger 1966:53). 


43 Digital pens allow the user to record what they write, and all data generated can be transmitted to a 
digital device through wireless technology. 
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researchers.“ Not unexpectedly, what the researchers found was that literary analysis is a 
painstakingly slow, methodical, and iterative process, in which reflection is essential, helping 
generate meaning-making and insights about the texts under investigation (cf. Srivastava 
& Hopwood 2009:76). In a subsequent phase, the researchers designed what they call a 
“metatation” interface or system, generating visualisations of participants’ analyses only after 
their sense-making phase. Thus, for instance, if one participant underlined specific words 
in a poem to make sense of synonyms and antonyms, the metatation system augmented the 
analysis by adding additional synonyms and antonyms, in this way drawing the participant's 
attention to words he/she had missed. This prototype interface allows scholars to interact 
with texts in their own time and space, and because it is introduced only after the analyses 
have been carried out, it does not impede the thinking and sense-making process in any way, 
thus respecting the performative nature of interpretation in the humanities. 


5.3 Data as capta 


Capta is not data as we typically understand data. Capta represents what is seen, thought 
and felt. — Bryan Beverly (2017:2) 


In order to reconcile the humanities and visualisation, the latter needs to be re-conceptualised 
to preserve the integrity of humanistic data, and here, Druckers (2011:2) “polemic”, as she 
refers to it, involves strongly urging humanists to re-conceive of data as capta. This call has been 
made by many other scholars over the past few years (Clement 2012; Kitchin 2014; Maier & 
Deluliis 2015; Enslen 2016; Furner 2016), one of the reasons being that the etymology of 
the term data in the context of the humanities is fairly problematic. ‘Datz’, originally derived 
from the Latin word datum, literally means ‘something given’, which erroneously creates the 
impression that data is always a given (Owens 2011). In an earlier chapter, we mentioned the 
fact that the notion of (big) datasets as constituting raw information is rather troublesome, 
given that data does not just come into existence: capturing data and then interpreting it 
is based on specific choices on the part of the researcher that encompass judgement and 
discernment, amongst other things. Yet, in an increasingly datafied society, the realist view 
of knowledge continues to persist, presenting various phenomena as existing outside of the 
observer when, as correctly pointed out by Drucker (2011:1), “[rlendering observation (the 
act of creating a statistical, empirical, or subjective account or image) as if it were the same 
as the phenomena observed collapses the critical distance between the phenomenal world 
and its interpretation, undoing the basis of interpretation on which humanistic knowledge 
production is based” (cf. Ambrosio 2015:137; Kennedy, Hill, Aiello & Allen 2016:719). 


44 In the humanities, an annotation constitutes information employed to classify, code or comment on 
the data sources collected (Evers 2018:64). 
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To generate a more nuanced understanding of data visualisation in the humanities, 
capta (literally meaning ‘taken’ in Latin) appears to be a far better term, since it captures the 
constructivist notion that knowledge is “taken” in the sense that it is partial, situated, and 
constitutive (Drucker 2011:2). Levi (2013:34) expresses it succinctly when she says that 
“[h]umanistic data are as un-data as they can get” ... [because] [h]umanities data are not 
generated by instruments, but by people in the process of going about their everyday life”. 
Drucker (2011:2) proposes a number of fundamental principles to be followed when capta 
is constituted and displayed, going so far as to argue that if these principles are ignored, 
what will ultimately be compromised is the very authority of the knowledge generated by 
humanists: “[t]he digital humanities can no longer afford to take its tools and methods from 
disciplines whose fundamental assumptions are at odds with humanistic method”.“ 

Drucker (2011:13) refers to John Snow’s well-known map of cholera outbreaks 
(discussed in Chapter 1) as an excellent example of how visuals may be (re-)designed to 
reflect co-dependent constructivism rather than observer-independent realism. Snow’s map 
as depicted in Figure 5.8 quite effectively illustrates the role of the pump in the number of 
people who died of cholera in London. In Drucker’s (2011:14) view, therefore, the map 
served its purpose. However, she goes on to wonder who the dots are, arguing that the 
addition of demographic features such as an individual’s age, health, and family role could 
have provided a more nuanced, complex statistical view of the epidemic. Drucker (2011:14) 
does not stop there, suggesting that the rate of deaths and their frequency could be integrated 
into the map on a temporal axis to reflect “increasing panic”. This is a fairly novel notion, 
namely, that “[t]he display of information about. . affective experience can easily use standard 
metrics” (Drucker 2011:7). The terrain could also be re-drawn from the perspective of, for 
instance, an individual who has lost a loved one, not only to illustrate the urban streetscape 
of nineteenth century London, but also to highlight what Drucker (2011:14) describes as 
“[features] of the graphical representation of humanistic interpretation”.“” The reinvisaged 
graphic is illustrated in Figure 5.9. 


45 This does not mean that humanists and scientists are engaged in some kind of intellectual battle. 
Drucker (2011:2) concedes that her data-capta distinction “is not a covert suggestion that [...] only 
the humanists have the insight that intellectual disciplines create the objects of their inquiry. Any self- 
conscious historian of science or clinical researcher in the natural or social sciences insists the same is 
true for their work”. 


46 We contend that the humanities and not only the digital humanities falls under this admonition. 


47 Drucker (2011:7) recognises that what she is suggesting are “subjective methods”, and risks the 
observation that “[r]ecognizing that such subjective methods are anathema to the empirically minded 
makes me even more convinced that they are essential for the generation of graphical displays of 
interpretative and interpreted information’. 
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Figure 5.8: John Snow’s map of cholera outbreaks in London (drawn by Snow circa 1854, and taken 
from Stamp’s (1964) The geography of life and death) 


Figure 5.9: John Snow’s map reinvisaged (Drucker 2011:19, with credit to XArene Eskandar for the graphic) 
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An important tenet implicit in Drucker’s (2011) description, and one that highlights 
the distinction between data and capta, is that graphical displays in the humanities should 
always follow a humanistic approach which entails infusing the displays with affect through 
expressive graphics and metrics. This tenet is exemplified in the suggestions that John Snow’s 
map would be even more striking if it (1) reflected rates of death to show “increasing panic” 
(Drucker 2011:14) and (2) included features that reminded the viewer of what the streets of 
London looked like in the 1850s. What runs like a golden thread through Drucker’s (2011) 
suggestions is respect for the interpretative nature of knowledge represented in qualitative 
displays, which is another tenet she upholds. 

Although they are still in the minority, some scholars working with large volumes of 
data have also called for more attention to be paid to the emotional dimensions of graphical 
representations. Kennedy and Hill (2017b) recently conducted an empirical study with 
a view to illustrating how individuals engage on an emotional level with data and their 
visualisation. The conclusion reached was that visualisations should inform individuals’ 
minds and hearts — that “it is not only numbers but also the feeling of numbers that is 
important” (Kennedy & Hill 2017b:1). This argument is advanced by other researchers such 
as Norman (2004), Grosser (2014), and Kirk (2016), since emotions not only play a key role 
in our social and cultural experiences (cf. Kennedy & Hill 2017b), but also support reason 
and rational thinking (cf. Stratton 2012). Kennedy and Hill (2017b:3) speculate that the 
neglect of emotional dimensions could, at least in part, be due to the history of visualisation 
which may be traced back to the Age of Enlightenment which was preoccupied with reason 
rather than subjectivity, tradition or passion.“ 


5.4 The emotional and social pitfalls of visualisation 


Data are not just numeric — they are both statistical and visual. In part because of this 
entanglement, data stir up emotions. — Helen Kennedy and Rosemary Hill (2017b:2) 


Up to this point, we have discussed the fact that data visualisations may have cognitive 
drawbacks if they are not designed with a human cognition framework in mind. Additional 
caveats to bear in mind, particularly in light of Drucker’s (2011) call for visual displays 
imbued with affect, pertain to what researchers may not be aware of, namely, the emotional 
and social risks of visualisation, risks which are often overlooked or dismissed in favour 
of the cognitive effects of visualisation (cf. Roos, Bart & Statler 2004:551). As far as the 
emotional dimension is concerned, and based on a multidisciplinary literature review of data 
visualisations, Bresciani and Eppler (2015:4) point out that visualisations may unwittingly 


48 This is certainly an over-simplification; van Holthoon (2017:8) quotes Hume, who claimed in 
A treatise of human nature (1739) that “[r]eason is and ought only to be the slave of passions”. 
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cause viewers to experience negative or inappropriate feelings. In terms of emotion and 
encoding, their study” has revealed that viewers may find visualisations disturbing (Tufte 
1990; Cawthon & Moere 2007) uninteresting (Cawthon & Moere 2007) or unattractive 
(Cawthon & Moere 2007; cf. Krum 2013). As far as decoding is concerned, some 
visualisations (such as those that flicker or are striped) have been found to cause visual 
stress to the point of causing viewers to feel ill (Ware 2007). Viewers may also favour some 
visualisations rather than others based on their personal likes and dislikes of certain visual 
elements (Tversky 2005). Finally, viewers may experience negative emotions when decoding 
an image if they have encountered this image before and experienced it in a negative way 
(Chen 2005; Avgerinou & Pettersson 2011). 

When it comes to the social effects of encoding, the role of hierarchy and the exercise 
of power involved in the process of designing a visualisation cannot be dismissed. In this 
regard, visual representations may be manipulated in such a way that viewers have access 
to some information, but are barred from possessing knowledge about certain aspects of 
that information based on the designers choice to (un)intentionally include or exclude 
information and knowledge, a risk outlined in a study by Ewenstein and Whyte (2007). 
What we should never forget is that “data visualisations are not neutral windows onto data: 
they privilege certain viewpoints, perpetuate existing power relations and create new ones 
and, as such, they do ideological work” (Kennedy & Hill 2017a:773). Another area of 
concern pertains to data visualisations that are designed in such a way that a specific point 
of view is emphasised to such a degree that there is no room for individuals to generate 
alternative views or invent other options (Whyte, Ewenstein, Hales & Tidd 2007). Bresciani 
and Eppler (2015) have also reviewed the literature to identify problems in the area of 
decoding. In the context of the use of visualisations in group interaction, some researchers 
have recorded altered behaviour on the part of members of the group. For instance, Eppler 
and Platts (2009) have observed that in cases where a graphical representation is generated 
in a group, the opinions of the members of that group may be suppressed by an individual 
who is regarded as dominant. Another (perhaps unintended) consequence of badly designed 
visuals is that they may be misinterpreted in different cultural contexts, given that symbols 
and colours are not universal in their meanings, a consequence that is well documented 
in the literature (Nisbett 2003; Ewenstein & Whyte 2007; Avgerinou & Pettersson 2011; 
Bresciani 2014; Forsythe 2014; Jahns 2014). The recency effect is an additional drawback 
of information visualisation observed by Tufte (1986, 2006) and Nisbett (2003), amongst 
others. In terms of the recency effect, a viewer's interpretation of a graphical representation 
may be coloured by a recent experience or event. 


49 We have added additional studies to Bresciani and Eppler’s (2015) list. 
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5.5 Visualisation and the problem of data power 


Is big data analytics good or evil? — Bill Franks (2015:1) 


If humanists thought that designing data visualisations was fraught with myriad challenges 
such as those just highlighted, then they have not yet considered the problems inherent 
in data power which can be exploited for good or ill: data displays present arguments and 
explanations put forward as objective facts, thus shaping the ways in which we perceive the 
world we live in (Boehnert 2016:1; cf. Kitchin 2014; Williamson 2017). Furthermore, and 
advancing critical approaches to visualisation, Boehnert (2016:18) observes that scholars 
are increasingly compelled to work within what she calls “the ideological scaffolding 
of the neoliberal political project”, which is becoming all too familiar to South African 
humanists and social scientists (Clare & Sivil 2014; Le Grange 2016). One of the negative 
consequences of this project is that it subjects human endeavours to “market principles” 
(Brown 2011:118), forcing academics to generate reams of research for the sake of the 
knowledge economy.” Under pressure to publish as much and as frequently as possible in 
a datafied world, academics are bowing to “the hegemony of metric power” (Feldman & 
Sandoval 2018:219),*' sacrificing the interpretative element of research typically undertaken 
by humanists and social scientists to data visualisation: “the desire for data visualisations can 
be understood as motivated by the need to operate within a market, as visualisations are 
seen as a means to ‘sell’ the research capabilities of university departments, as they market 
themselves to external organisations” (Kennedy & Hill 2017a:776). We re-iterate here that 
while data visualisations are an important element of the research process, they must not 
become mere “graphical primitives” (Manovich 2011:47) of the artefacts under investigation. 

In the next chapter, we take a closer look at the notion of data power in an era of big 
data, but this time in the context of studies carried out in the humanities and social sciences. 
Both disciplines face threats, “not least because ‘ethnography’ is often presented as the Other 
to big data” (Boellstorff 2013:2). Yet there are also many opportunities for humanists and 
social science scholars to showcase the kinds of contributions their disciplines can make to 
the big data phenomenon. 


50 Diane Powers Dirette (2016:2) expresses dismay when she observes that “[i]f statistical significance is 
found [in an article], the research article, regardless of the size of the data, is more likely to be published”. 

51 Kennedy and Hill (2017b:6) define metric power as “the growing prevalence of numbers, data and 
measurement in contemporary forms of government and control” (cf. Beer 2016b). 
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Data power in the era of big dala: 
Friend or foe? 


Datafication ... harbors both threats and opportunities for civic engagement. 
— Gutiérrez and Milan (2017:95) 


In her insightful book entitled Weapons of math destruction, mathematician and data scientist 
Cathy O’Neil (2016) offers a number of sobering thoughts on the dangers of big data power 
in the absence of social justice. Two choice comments she makes are that “Big Data has 
plenty of evangelists, but I’m not one of them” and “[while] Big Data, when managed 
wisely, can provide important insights, many of them will be disruptive”. With respect to 
the latter comment, O’ Neil (2016) laments both the fact that data scientists tend to present 
the problems reflected in ecosystems without also providing solutions to them, while many 
big data researchers employ algorithms they do not share with the public; “we see only the 
results of the experiments researchers choose to publish” (O’Neil 2016:148). We cannot 
ignore the fact that big data has a dark side too. 


6.1 Big data’s shadow side 


We have to explicitly embed better values into our algorithms, creating Big Data 
models that follow our ethical lead. — Cathy O’Neil (2017:256) 


In a paper published more than a decade ago, Ivor Baatjies (2005:30) painted a somewhat 
stark picture of scholarship undertaken at institutions of higher learning in South Africa 
when he wrote that “[c]orporate mentality and ... neoliberal fatalism have no regard for any 
form of research in favour of social justice, oppression and exploitation”. It appears that the 
emergence of big data may only have made the situation worse. Scholars refer to the many 
instances of big data being used as a tool to threaten democracy and increase inequality 
(O’Neil 2017). Gangadharan (2012), for instance, warns how large-scale commercial data 
profiling in terms of race and ethnicity has excluded vulnerable individuals from receiving 
economic, social or political benefits in North America. Indeed, Mayer-Schénberger 
and Cukier (2013) argue that we may be heading for big-data authoritarianism in which 
Big Brother will create even greater asymmetry of power between the privileged and the 
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oppressed.” The idea that Big Brother is watching our movements and tracking our personal 
data has also been raised by the South African media. In 2017, investigative journalist Heidi 
Swart® wrote the following in Daily Maverick: 


South Africa’s intelligence community outsources the analysis of social 
media platforms to the private security sector. The social media posts 
that people choose to make public ... can be used by intelligence services 
to accurately assess the sentiments, thoughts, movements and plans of 
people, groups, or institutions. Such analysis, called data mining, produces 
SOCMINT - social media intelligence. 


In addition, the South African government currently uses geofencing technology™ 
utilised by a software programme referred to as Media Sonar to monitor citizens’ social media 
activity, the aim being to identify potential threats to the country’s security (Swart 2018). 

South Africa experienced its biggest data leak in 2017 when an Australian information 
security expert, Troy Hunt, exposed the fact that 60 million South Africans’ personal data 
had been leaked. The data breach appears to have emanated from Jigsaw Holdings, a holding 
company for real estate franchises. According to Skolmen and Gerber (2015:4), South African 
organisations still have little understanding of the basic conditions of protecting personally 
identifiable information in terms of the country’s Protection of Private Information (POPI) 
Act signed into law in 2013 (Burke & van Heerden 2017:85). 

Other examples of the abuse of big data power include the now infamous Snowden 
disclosures about North America’s National Security Agency and its surveillance practices 
(van Dijck 2014), the establishment of “spurious correlations”? (Calude & Longo 2017) in 
large datasets, the manipulation of consumers’ buying choices through robotic “nudging” 


52 This is of course not a new notion: as Gangadharan (2012:1) aptly puts it, “old forms of prejudice 
and injustice can be grafted onto these new tools”. 


53 It is worth noting that Heidi Swart’s article was commissioned by the University of South Africa’s 
Department of Communication and the University of Johannesburg’s Department of Journalism 
(through its Media Policy and Democracy Project). 


54  Geofencing allows users to cordon off specific geographical locations using GPS technology (cf. 
Luxhoj 2016). 


55 Calude and Longo (2015:13) define a correlation as spurious “if it appears in a randomly 
generated database”. 


56 In Nudge: Improving decisions about health, wealth, and happiness, Thaler and Sunstein (2008) assert 
that people can be “nudged” into make better choices about various aspects of their lives. While some 
scholars have lauded nudging theory (Cohen 2013; Saghai 2013), others maintain that it is not as 
innocuous as it seems and that it is riddled with ethical dilemmas (Selinger & Whyte 2011; Leggett 
2014; Borenstein & Arkin 2016). 
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(Helbing 2015), and the engineering of public opinion during election campaigns (Tufekci 
2014; Bessi & Ferrara 2016). 

In all probability the most ominous big data project we have recently encountered 
is China's so-called “social credit” programme, which the Communist Party claims will be 
fully operational by 2020 to monitor its 1.4 billion citizens. Referring to the creation of a 
digital dictatorship, ABC journalist Matthew Carney (2018) reports that China has already 
launched a pilot programme in which each citizen has been assigned a default social credit 
score of 800 points which then changes depending on his/her behaviour which is tracked 
via the county’s 200 million state-of-the art CCTV cameras designed with facial recogniton, 
body scanning, and geo-tracking capabilities.” In addition, the Chinese government intends 
collecting data about its citizens’ behaviour from their smartphones as well as from their 
medical, financial, and purchasing records. In trial areas, this Orwellian project has already 
resulted in the punishment of approximately 10 million people who have lost points owing 
to “anti-government” behaviour such as bad driving, smoking in non-smoking areas, and 
defamation of the ruling party. Punishment is meted out in the form of fines and through 
lack of access to job promotions, travel visas, and the like. Those with high scores, on 
the other hand, are rewarded by way of incentives which include faster and easire access 
to housing, jobs or job promotions, and good schools. In an insightful article entitled 
‘Engineering the public’, Zeynep Tufekci (2014:12) observes that “[s]tarting an empirically 
informed, critical discussion of data politics now may be the first important step in asserting 
our agency with respect to big data that is generated by us and about us, but is increasingly 
being used at us”. (We consider Tufekci’s (2014) call in Chapter 9 when we discuss the need 
for scholars to approach big data science through a critical data studies lens that challenges 
stealthy practices such as dataveillance and big data privacy breaches by interrogating the 
social, political, ethical and economic implications of big data projects.) 

Academics too are beginning to harness the power of big data in ways that do not 
necessarily put the well-being of individuals first (Lewis et al. 2008; Kramer et al. 2014; 
Kirkegaard & Bjerreker 2016), an insidious practice we noted in Chapter 4. What appears 
to be exacerbating this trend is the pressure that some journals are placing on qualitative 
researchers to limit their discussion of empirical data and abbreviate any explanation of 
their data collection methods owing to these journals’ space constraints (cf. Chandler et al. 
2015:1; Messner, Moll & Strémsten 2017:440).%8 


57 https://www.abc.net.au/news/2018-09-18/china-social-credit-a-model-citizen-in-a-digital- 
dictatorship/10200278 

58 These space limitations may also inadvertently encourage the creation of graphical representations 
that are condensed to such a degree that they become over-simplifications of complex information. 
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Although this situation sounds anything but encouraging, big data nevertheless 
provides humanists and social scientists (who already have a long tradition of using their 
scholarship to study cultural, political, and social issues)” with new opportunities to foster 
humanistic and social scientific inquiry.® Some scholars may ask, Are humanists and social 
scientists who work with big data indeed creating awareness in the areas of humanity and society 
and/or advocating change in these areas? Below, we discuss just a few studies in the humanities 
and social sciences to give an indication of some of the fields that are beginning to make a 
contribution in the context of the big data movement. We focus predominantly on those 
studies in which the researchers have harnessed big data techniques, but without sacrificing 
traditional, qualitative methods. We are mindful that humanists and social scientists use 
diverse methodological approaches, and that our discussion does not do justice to covering 
the full range of what each discipline entails. Our purpose here is simply to give readers a 
taste of what humanists and social scientists who exploit big data are currently doing. 


6.2 Engaged humanists and social scientists 


Big Data processes codify the past. They do not invent the future. Doing that 


requires moral imagination, and that’s something only humans can provide. 


— Cathy O’Neil (2017:256) 


In the era of big data, humanities scholars and social scientists may be struggling 
between what they perceive to be only two choices: to continue analysing texts in a slow, 
methodological way but without analysing the vast amounts of information at their disposal, 
or to exploit big data methods that sacrifice interpretation to metrics (Gregory, Cooper, 
Hardie & Rayson 2015:151). However, it is not as simple as choosing one option over the 


59 These kinds of studies are diverse, ranging from the exploration of gender patterns in reading literacy 
(Zuze & Reddy 2014) to the examination of petrocultures (Szeman 2017) and food security (Pradhan 
& Rao 2018). 


60 According to Milan and Gutiérrez (2015:121), the era of big data has fuelled a type of activism 
referred to as data activism among citizens and ordinary people as opposed to only hackers and 
open-source activists. They distinguish between pro-active data activism and re-active data activism, 
although both conceive of information as “a constitutive force in society able to shape social reality” 
(Milan 2018:155). Re-active data activists resist digital censorship, control, and surveillance through 
technical practices such as encrypting personal communications, activating anonymous browsing, 
and blocking advertisements on websites (Milan & van der Velden 2016:57). Pro-active data activism 
is a little more difficult to conceptualise as it is a relatively new empirical phenomenon, but activists in 
this domain typically utilise tactics that “range from technology development projects and platforms 
for the manipulation of data and the visualization of data patterns for campaigning and advocacy” 
(Milan & Gutiérrez 2015:129). Strictly speaking, humanists and social scientists do not fit into either 
of these categories, but we contend that many regard themselves as activists if their work is geared 
towards social change (see Section 6.3 in this respect). 
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other since scholars need to strike a fine balance between close reading, which cannot be 
divorced from qualitative approaches to research, and big data methodologies (Gregory et 
al. 2015:151). This recommendation is echoed by DeLyser and Sui (2012:3) who call on 
researchers to be informed by scholarship in the traditional humanities and social sciences 
to ensure that they not only resist “superficial number crunching” of data, but that their 
findings are also contextualised. This advice has been taken to heart by a number of scholars 
with fruitful results. 

In the spatial humanities, for instance, Porter, Atkinson and Gregory (2015) have 
successfully combined two disparate methodologies to analyse and map nineteenth- and 
early twentieth-century disease mortality patterns in infants recorded in official population 
reports for Britain and Ireland from 1850 to 1911. These paired methodologies are 
geographic information systems or GIS (used to capture, store, manage, analyse, and display 
spatial or geographic patterns) and corpus linguistics, traditionally employed by linguists to 
analyse vast volumes of digital texts. Porter et al. (2015:27) chose to supplement GIS with 
corpus linguistics because the latter enabled them to both quantitatively and qualitatively 
analyse a digital text containing two and a quarter million words. The innovative merging of 
the methodologies, which the researchers call Geographical Text Analysis (GTA), uncovered 
new insights into how the population reports were linked to changing mortality patterns in 
infants over a period of time. The researchers maintain that GTA could be used to make 
comparisons between the geographies mentioned in the texts and spatial distributions of 
specific events (Porter et al. 2015:33). GTA techniques could also be exploited to help 
researchers understand other historical and contemporary documents such as library archives, 
newspapers, diaries, photographs, and literary narratives (Porter et al. 2015:34). The kind of 
qualitative GIS employed by Porter et al. (2015) is growing rapidly and contributing to the 
big data movement in many fields such as sociology, history, and the digital humanities in 
light of the realisation that big qualitative data such as images and texts also need to be coded 
for computation (Pavlovskaya 2016:1). 

Big data has also partnered quite productively with the field of history and indeed, 
big history courses have become popular at universities in Australia, the Netherlands, and 
the United States (Spier 2014:171) because it “offers a fundamentally new understanding 
of the human past, which allows us to orient ourselves in time and space in a way no other 
form of history can match” (Spier 2014:172).°' In this regard, big history enables scholars 


61 An extensive search of the web indicates that South African historians have not yet embraced big 
history. We did find reference to big history on a website hosted by PASCAP, an after-school care 
organisation that aims to bring big history to children in the Western Cape. On the site, Chris 
Wheeler makes the following comment with specific reference to an integrated history of the cosmos, 
earth and humanity: “I can’t help but be inspired by the values and the spirit of wonder Big History 
embodies, as something that could help empower and galvanise a new scientifically literate generation 
to add their voices, histories, insights and knowledge to the greatest story ever told”. 
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to see big patterns they would not ordinarily have seen had they chosen to explore only 
small procedures (Spier 2014:172). Spier (2014:179) speculates that it is this ability to see 
new patterns that has encouraged many big historians across the globe to collapse several 
approaches into a single approach, typically combining scientific views and historical accounts 
into a single large narrative accompanied by rigorous theory. Walter Alvarez’s (2016) book, 
A most improbable journey: A big history of our planet and ourselves, is a reminder that big data 
should never be staid or one-dimensional in nature and that they may generate enthusiasm 
for history among academics and students alike. Alvarez (2016) provides an ambitious big 
history of the universe, foregrounding geological history, but also incorporating cosmic 
history, an account of earth’s major geographical features, and even a discussion of our 
human bodies to generate a personal narrative about the planet’s history.” Like other big 
data scholars, big historians also struggle to access large datasets from commercial providers 
when they “[find] themselves locked out of the digital archive by paywalls and search 
structures which are unhelpful for the kind of analysis that they wish to conduct” (Maxwell- 
Stewart 2016:360-361). Fortunately, Maxwell-Stewart (2016:362) signals that there are 
ways around this problem, and that the goal is not necessarily to attempt to capture vast 
amounts of information from a single source. What many big data scholars are doing instead 
is to extract information from multiple sources online, thus collecting large quantities of 
metadata on a particular topic (cf. Chapters 1 and 4 in this book). Historian Catherine Hall 
(2016), for instance, trawled through vast amounts of online data to find information on 
slave compensation pay-outs, settler colonialism in Australia, and the migration of imperial 
families to achieve a deeper understanding about the movement of human and financial 
assets into Australia between the 1830s and 1840s. Collecting data from multiple digital 
sources is one way in which scholars may overcome the lack of access to big data discussed 
in previous chapters. 

Although some political scientists view big data with a degree of scepticism, arguing 
that big data itself cannot help them achieve valid causal inferences and that theory building 
is a key aspect of their research (Grimmer 2015:81), others have produced both insightful 
and useful projects when they have combined big data with descriptive inferences, which 
in turn have helped them create new theories (Grimmer 2015:80). One such project is 
the VoteView project (Pool & Rosenthal 1997; McCarthy, Pool & Rosenthal 2006) in 
the United States that, among other things, makes use of NOMINATE scores to map US 
House and Senate representatives’ ideological positions. According to Grimmer (2015:81), 
an interesting finding to come out of NOMINATE is that ideological polarisation between 
Republicans and Democrats has increased significantly over the last four decades. Other 
political scientists have harnessed the power of big data to conduct US presidential 


62 See J. Daniel May’s (2017) review of the book in Journal of Big History. 


77 


Chapter 6 


election forecasting (Linzer 2013) and conflict forecasting (Brandt, Freeman & Schrodt 
2011). Social and political scientists Colleoni, Rozza and Arvidsson (2014) have combined 
machine learning and social network analysis to predict whether Twitter users are Democrats 
or Republicans. One of their main findings is that Democrats who are not inclined to track 
Twitter's official accounts tend to demonstrate higher levels of homophily — the tendency to 
connect with similar individuals — while activists who do follow official accounts have lower 
levels of homophily. The researchers suggest that in order to better understand political 
homophily, Twitter users’ political and cultural practices need to be considered as well. The 
lesson here is that big data scholars doing social media research should not focus solely on 
the social media platform under investigation: “[such] a turn away from ... treating the 
Internet or SNS as a separate reality and towards a focus on the Internet as one among many 
aspects of social reality in general ... might open up interesting and fruitful avenues for big 
data analysis” (Colleoni et al. 2014:329). 

Digital data in education is another area in which big data has made significant 
inroads, particularly given the shift in recent years from traditional classrooms to blended 
and online learning which employs learning management systems such as Moodle, 
Blackboard, LearnUpon, and Fuse Universal (cf. Reyes 2015:75).° As is the case in areas 
of big data science and business intelligence, education has also undergone datafication.™ 
As Selwyn (2015:66) puts it, “schools, colleges, universities and other educational contexts 
now function increasingly along ‘data driven’ lines”. In fact, digital data work in education 
has become normative and this is particularly evident in how learning analytics (which 
measures, collects, and analyses student data with a view to improving teaching and learning) 
has been embraced by educators (Greller & Drachsler 2012; Siemens 2012). We include a 
discussion of digital data in education here because sociologists have become increasingly 
concerned about some of the ethical facets of big digital data in education. One of these 
facets pertains to data inequality because control, social power, and inequality may be 
reinforced through processes driven by data (Selwyn 2015:71). Another cause for concern 
is related to the fact that digital data tends to reinforce what Selwyn (2015:71) refers to as 
an “increase in managerialism within education”. One bleak study has highlighted how 
school authorities, after being informed by data generated at a historically African-American 
school, proposed to close it down, ignoring the fact that the sensibilities of the surrounding 
community should also be taken into account in the decision-making process (Khalifa, 
Jennings, Briscoe, Oleszweski & Abdi 2014:148). Of deep concern is that digital data in 


63 We refer to educational studies here, since scholars in the humanities may regard this field as a 
humanistic endeavour. 

64 Data could include, among other things, naturally occurring data generated via a learning 
management system, data based on library use, assessment grades, and the like. 
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education could be exploited to conduct “dataveillance” (Selwyn 2015:73). Unfortunately, 
dataveillance practices have been observed by a number of scholars including Land and 
Bayne (2005), Knox (2010), Rosenzweig (2012), and Taylor (2013). Sociologists remain 
perturbed that school authorities may be more interested in the operationality that digital 
data demonstrates than in the social meanings it generates. Selwyn (2015:79) calls on all 
interested stakeholders — academics, educators, parents, and pupils — to take a stand against 
“the ‘politics of [big] data’ in education, and not to take them at “face value”. 

Social scientists studying different aspects of social media have also begun exploring 
the benefits of complementing their traditional research methods with those from big data. 
An enlightening study in this respect is one by Williams and Burlap (2016) who combined 
their criminology and computer science skills to examine cyber hate on Twitter following 
the murder of Lee Rigby by Islamist extremists in 2013. What makes this study innovative 
is not only the fact that it reflects a new interdisciplinary methodology the researchers call 
“computational criminology” (Williams & Burlap 2016:217), but also that it is one of the 
first to exploit sophisticated big data techniques and analytics to analyse contemporary crimes 
such as cyber hate. Another informative study using big data analytics is one conducted by 
Innes, Roberts, Preece and Rogers (2016) aimed at analysing social media reactions to the 
murder of Lee Rigby. In contrast to Williams and Burlap (2016) who paid attention to 
quantified measures of cyber hate speech following the terrorist attack, Innes et al. (2016) 
focused on understanding the content of social media communications through detailed 
qualitative coding of users’ tweets. Reviewing Felt’s (2016) article on the intersection 
between social science and big data analytics in the context of social media research, it is 
clear she would in all likelihood favour the study by Innes and his colleagues rather than that 
by Williams and Burlap (2016) as she contends that only traditional, qualitative methods 
would “enable both the big picture and the close, critical view” of the data collected. 

Psychology is yet another discipline that could make a meaningful contribution to 
big data research, and like researchers in other disciplines, those conducting psychological 
research are of the view that psychologists should harness their own competencies rather 
than try to develop new computational skills. Two psychologists who support this position 
are Cheung and Jal (2016:4) who point out that most psychologists are sufficiently 
competent to make use of data analytics to analyse large datasets because they are trained 
in psychological theories, statistics, and psychometrics. These researchers propose that 
scholars make use of what they call the “SAM” or “split/analyse/meta-analyze approach” 
to test their theories on big datasets that are based on human behaviour. Since standard 
computers cannot accommodate huge amounts of data, the first step entails splitting the 
data into many datasets. In the analysing step, common statistical analyses are employed 
such as regression analysis and multilevel analysis. The third step entails carrying out a meta- 
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analysis so that statistical inferences can be made (Cheung & Jal 2016:4). At the same 
time, Cheung and Jal (2016:10) argue that psychologists should “lower the threshold for 
engaging in big data research” because data analytics is complex. They also suggest that 
psychologists collaborate with researchers who have sophisticated computational skills, a 
suggestion that, as we have seen, has been made by many scholars since the advent of the big 
data phenomenon (Chapter 2). 

A recent study to emerge out of the digital humanities is one by Brown, Mendenhall, 
Black, van Moer, Lourentza, Flynn, McKee and Zevai (2016) who employed black feminist 
theory to make sense of large datasets. What makes this data study quite fascinating is that it 
simultaneously interrogates the biases in computational analysis and the digital humanities 
while exploiting both to recover and preserve black women’s narratives. Challenging “the 
embedded whiteness and maleness of computational analysis” and critiquing the digital 
humanities for failing to examine issues of identity and power, Brown et al. (2016) searched 
through approximately 800 000 documents — articles, books, and newspapers — archived in 
two digital libraries, JSTOR and the HathiTrust, in an attempt to identify black women’s 
perceptions about and lived experiences in the United States. In order to make sense of the 
large datasets, Brown et al. (2016) made use of MALLET for topic modelling, comparative 
text mining, and data visualisation. Topic modelling allowed the researchers to explore and 
interrogate different genres of text such as sociology or poetry in the large datasets, while 
comparative text mining helped them identify latent and common themes across all data 
collected. Isolating a sub-set of all the data collected, Brown et al. (2016) discovered that 
many texts by or about black women do not have metadata tags,“ which means that scholars 
navigating the Internet will not know of their existence. The research team’s ultimate goal is 
to identify all untagged volumes and then make the entire corpus available online. 

In the field of linguistics, a number of big data studies have begun focusing on 
artificial intelligence fields (such as machine learning and natural language processing ) to 
detect and/or prevent cyberbullying and suicide. The former, defined as “wilful and repeated 
harm inflicted through the medium of electronic text” (Patchin & Hinduja 2006:152), 
has reached perturbing levels owing to the rapid growth of social networking globally, and 
for this reason several researchers have turned to big data analytics and combined it with 
computational linguistics with a view to creating language models to automatically detect 
cyberbullying content. For instance, in a recent study by van Hee, Jacobs, Emmery, Desmet, 
Lefever et al. (2018), trained linguistics employed a fine-grained annotation scheme to 


65 This research team was made up of humanities scholars, social scientists, and data researchers. 


66 ‘Tags are metadata (keywords/terms/snippets of text) about a particular resource. For example, ‘South 
African novelist’ and ‘playwright’ would be tags for Zakes Mda; they describe the author, thus 
providing the browser with useful information. Many websites provide user-friendly tutorials on 
how to create metadata tags. 
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analyse 192 085 social media posts from a site called ASKfm. The main aim of the research 
team was to model bullying attacks and reactions from both victims and bystanders so that 
any signals of potential cyberbullying events could be quickly investigated and prevented 
by human moderators.’ The researchers’ annotation scheme took into account that 
existing parental control tools, while useful in detecting keywords that reflect profanity or 
insulting words, are unable to perceive covert forms of cyberbullying, particularly when 
these forms do not contain explicit vocabulary. The annotation categories thus encompassed 
expressions related, amongst other things, to threats/blackmail, insults, curses, defamation, 
sexual talk, bystander/victim defense, and encouragement in support of the harasser 
(Hee et al. 2018:8-9). The development of fine-grained annotation schemes is crucial, given 
that some forms of cyberbullying such as defamation may be fairly difficult to recognise.“ 
Other researchers who have combined big data analytics and computational linguistics 
to detect cyberbullying include Dinekar, Reichart and Lieberman (2011), Spitzberg and 
Gawron (2016), Power, Keane, Nolan and O’Neill (2017), and Lee, Lee, Park and Han 
(2018). These researchers and many others would agree that what makes the automatic 
detection of cyberbullying difficult is the deliberate obfuscation of abusive language, hence 
the need for more sophisticated detection systems. What scholars have found is that abusers 
use various strategies to avoid exposure by replacing a single character in an offensive word 
(cf. Pitsilis, Ramampiaro & Langseth 2018) or by making use of newly invented words 
(Lee et al. 2018:23). 

As far as assessing risk of suicide is concerned, the linguistic analysis of suicide notes 
is not new and can be traced back to Clues to suicide (1957) edited by Edwin Shneidman and 
Norman Farberow (cf. Desment & Hoste 2013:6352). Employing discourse analysis for the 
most part, contributors to this book analysed 66 authentic or fabricated suicide notes in an 
attempt to distinguish between real and false notes.” Essentially early work on suicide risk 
assessment was based on analyses of surface-level elements such as the individual’s choice 
of verbs, adverbs, modals, and auxiliaries (Osgood & Walker 1959; Gleser, Gottschalk & 
Springer 1961; Desment & Hoste 2013:6352). A number of researchers have made use of 
computational methods to classify suicide notes as genuine or elicited (inauthentic). Jones 


67 Research by van Royen, Poels, Daelemans and Vandebosch (2014) suggests that most online users are 
not averse to automatic monitoring of cyberbullying on condition that their privacy and autonomy 
are guaranteed. 


68 For example, Hee et al. (2018:10) point out that while an encouragement in support of the harasser 
such as “I agree we should send her hate” is explicit and thus easily recognisable, one such as “hahaha” 
or “LOL” is not. 


69 For example, research shows that in contrast to simulated notes, real suicide notes tend to be longer. 
They also contain more pronouns as well as more references to people and social phenomena 
(Ferndndez-Cabana, Ceballos-Espinoza, Mateos, Aeresa Alves-Pérez, Gómez-Reino Rodriguez & 


Garcia-Caballero 2015:147). 
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and Bennell (2007), for example, classified suicide notes in terms of structure variables 
such as sentence length and parts of speech as well as in terms of content features such as 
the individual’s instructions and explanation for his/her intention, while Handelman and 
Lester (2007) analysed the semantic content of words in suicide notes using a text analysis 
programme developed by Pennebaker, Fancis and Booth (2001). In the last decade, 
researchers have begun experimenting with machine learning techniques in order to classify 
suicide notes (Pestian, Nasrallah, Matykiewicz, Bennett & Leenaars 2010; Yang, Willis, 
De Roeck & Nuseibeh 2012; Cheng, Li, Kwok, Zhu & Yip 2017; Just, Pan, Cherkassky, 
McMakin, Cha, Nock & Brent 2017; Walsh, Ribeiro & Franklin 2017). It appears that 
traditional statistical approaches to predicting suicide attempts in clinical psychology are not 
as accurate as machine learning techniques since they generally employ logistic regression; 
commenting on work carried out by researchers such as Colin, Walsh, Ribeiro and Franklin 
(2017) and Just e¢ al. (2017), O’Connor and Kirtley (2018:7) argue that “[r]ecent advances 
in machine learning techniques allow the computation of optimized risk algorithms, from 
hundreds of different individual variable pathways, to suicidal thoughts and behaviour”. 

In South Africa, scholars in the humanities and in the social sciences have embarked 
on exciting research projects through technological and data-driven research. Karli Brittz 
(2018), for example, who works in the Department of Visual Arts at the University of 
Pretoria, explores data artists works which reflect the use of big data analytics as well as 
visualisation techniques with a view to making big data more accessible and useful to society. 
What makes Brittz (2018) study particularly interesting is that she attempts to reconcile art 
and dataism. 

A review of the scholarly literature reveals that it is in the field of learning analytics 
that South African scholars have begun harnessing the power of big data. Matsebula and 
Makandla (2019), for example, have considered how big data architecture in the context of 
higher education needs to be conceptualised and tailor made for specific institutions so that 
analysts are able to extract meaningful insights from educational data. In their study of how 
learning analytics is practised in South Africa, Lemmens and Henn (2016: 250) offer the 
caveat that this dimension of institutional research should not result in a “data silo”, which 
describes “a situation where fragmented data is collected, analysed and stored on personal or 
distributed systems”. Other scholars who have explored learning analytics in South African 
higher education settings are Jordaan and Van der Merwe (2015), who have reviewed best 
practices for implementing a learning analytics strategy in institutions of higher learning 
and Prinsloo and Rowe (2015), who call for analysts to be mindful of the ethical issues 
surrounding the use of student data. We do not offer a detailed opinion on learning analytics 


70 Handelman and Lester (2007) classified words in terms of the use of future tense verbs, metaphysical 
references, social references, and negative or positive emotions, to name a few variables. 
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from an African perspective “considering that the African continent comprises 54 sovereign 
states, each with its unique regulatory framework, development agenda, information and 
communications technology (ICT) infrastructure, and state of adoption of online learning” 
(Prinsloo 2018:25). 


6.3 Big data as an obstacle/bridge to humanitarian projects 


Big data and increased connectivity allow humanitarian organizations to better 


understand where to target humanitarian assistance. — Helena Kamper 


An increasingly significant online trend is the use of big data in big digital humanitarianism, 
which may be defined as “the enacting of social and institutional networks, technologies, and 
practices that enable large, unrestricted numbers of remote and on-the-ground individuals 
to collaborate on humanitarian management through digital technologies” (Burns 2014:52). 
The purposes of digital humanitarianism are myriad, ranging from managing social/natural 
disasters and uncovering truths or untruths to understanding why disasters occur in the 
first place (Barnett (2008:259) — all of which are not too far removed from what many 
social scientists (and some humanists)” do (cf. Barnett 2008:259). A simple Google search 
using the keywords digital humanitarianism/humanitarians and social sciences/social scientists 
yields an interesting result, namely, that digital humanitarians are calling on social and 
data scientists to assist them in dealing with what Patrick Meier (2015:99) describes as 
“the overflow of information generated during a disaster [and which] can be as paralyzing 
to humanitarian response as the absence of information”. In this regard, a major crisis that 
the big data deluge has created pertains to the flood of misinformation or disinformation 
generated on social media platforms during disasters. Bruce Lindsay (2011:7), for example, 
points out that false, inaccurate or malicious social media posts hinder or delay humanitarian 
response efforts and in some cases create an unsafe environment for both the community 
and first responders. To pre-empt or at least reduce these kinds of consequences, we argue for 
social science scholars to collaborate with data scientists and to employ crowdsourcing with 
a view to verifying or invalidating huge amounts of information generated via social media 
platforms during times of disaster. In the context of disaster management, crowdsourcing 
“is the volunteer-generated, decentralized contribution of [crisis] information online” 
(Harrison & Johnson 2016:17), and several researchers have begun tapping into this kind 
of information to glean valuable and accurate information about disasters (cf. Barton 


71 Linguistic analyses of huge volumes of online posts are widely employed to extract useful information 
during disasters. Sarah Vieweg (2012), for example, carried out a linguistic analysis of verbs in tweets 
with a view to showing how such an analysis may augment situational awareness of mass emergency 
situations. Cresci, Tesconi, Cimion and Dell’Orletta (2015) also used linguistic analysis to detect 
social media messages vital for accurately assessing damage during natural disasters. 
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2018). In one interesting study, a team of researchers made use of automatic methods to 
extract useful information from text-based microblogging messages generated when the so- 
called Joplin 2011 tornado struck Joplin, Missouri in the United States with catastrophic 
results (Imran, Elbassuoni, Castillo, Diaz & Meier 2013). What the research has yielded is 
an automatic system that filters out information irrelevant to a given disaster while detecting 
informative messages that will facilitate what digital humanitarians refer to as situational 
awareness of the disaster. Since this 2013 study, many others have appeared, focusing on, 
amongst other things, ebola outbreaks (Odlum & Yoon 2015; Kim, Jeong, Kim, Kang & 
Song 2016), rockslides (Dammeier, Moore, Hammer, Haslinger & Loew 2016), and floods 
(Lo, Wu, Lin, Hsu 2015). 


6.4 Size revisited 


Big data: Does size matter? — Timandra Harkness (2016:1) 


To return to data size, the questions uppermost in readers minds might be these: is it 
necessary for qualitative researchers to collect enormous amounts of data to achieve their 
research objectives? In other words, is the analysis of large datasets theoretically justified? 
Will the demand for big data projects result in the demise of small data studies? Mahrt and 
Scharkow (2013:23) partially answer both questions when they contend that researchers 
need to determine if coding a huge data set has any inherent value since it may have 
limitations in terms of validity and scope (Mahrt & Scharkow 2013:27). We noted in 
Chapters 3 and 4 that a given sample, no matter how large, may not be representative of 
a specific population, making generalisability of results difficult if not impossible. In the 
field of social media research, for example, Manovich (2012:465) laments that “[p]eoples’ 
posts, tweets, uploaded photographs, comments, and other types of online participation 
are not transparent windows into their selves; instead, they are often carefully curated and 
systematically managed”. Big data experts advise researchers to first generate a small data 
sample to glean preliminary insights before collecting large amounts of data (Rojas, Kery, 
Rosenthal & Dey 2017:26). When determining just how much data to collect, researchers 
should always take the (social) context of data into account since the data cannot speak for 
itself (cf. Frické 2015). Throughout this book, we have seen numerous examples of studies 
that rely on the context of the data collected to make sense of the content in that data. For 
media and communications scholar Shani Orgad (2009:34), a fundamental question to be 
asked when doing a qualitative analysis of data taken from the Internet is the following: 
“what does ‘the Internet’ stand for in a particular context, for particular agents”? She rightly 
observes that cyberspace is not a monolithic space; it is “a collection of locations much 
like the real world” (Reips, Buchanan, Krantz & McGraw 2015:141). Thus, for example, 
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when Orgad (2005) explored how breast cancer patients communicate in online spaces, she 
initially mapped what she calls “the landscape of breast cancer patients’ communication” 
(Orgad 2009:34) by identifying the spaces or arenas in which these participants engage 
which may be (online) message boards and (offline) personal diaries. 

To come back to the question of data size, and in the context of social media research, 
Mahrt and Scharkow (2013:20) have noted that smaller-scale analyses of online users’ 
messages and/or behaviour may yield meaningful insights, provided that researchers employ 
sampling, measurement, and analytical procedures that are sound. Kaplan, Chambers and 
Glasgow (2014:342) offer the caveat that we should not be seduced by the notion that huge 
datasets, as opposed to small ones, will yield more reliable and meaningful findings. Kaplan 
et al. (2014) provide several examples from epidemiology, clinical trials, and health service 
research to illustrate that very large datasets could result in sampling biases” and significant 
inferential errors.” 

Scholars have argued that since size is relative in the world of big data and that 
it is not linked solely to volume, small data can in fact be quite large in size. We have 
noted, for instance, that in their study of black women’s narratives, Brown et al. (2016) 
collected 800 000 periodicals. This is a large dataset, although some big data analysts would 
argue that while it reflects variety in the sense that it is made up of books, articles, and 
newspapers, it lacks the other fundamental ontological attributes that big data should have, 
namely, velocity and volume. The solution to this problem is to scale small data into data 
infrastructures — that is, “pool ... and link small data in order to create larger datasets” 
(Kitchin & Lauriault 2015:463):”4 


Whilst the scaling of small data into data infrastructures does not create big 
data, in the sense that the data still lack velocity and exhaustivity, it does 
make them more big data-like by making them more extensive, relational 
and interconnected, varied, and flexible. This enables two effects to occur. 
First, it opens scaled small data to new epistemologies and, in particular, 
to new forms of big data analytics ... [and second], it facilitates small data 
being conjoined with big data to produce more complex, inter-related and 
wide-ranging data infrastructures (Kitchin & Lauriault 2015:470). 


72 For example, Kaplan et al. (2014:343) describe a nurses health study of just over 48000 
postmenopausal women that did not take into account the atypical nature of the sample under 
investigation. Researchers in the study erroneously concluded that hormone replacement therapy 
significantly reduces coronary heart disease. 


73 See Chapter 3 in which we referred to Leinweber’s (20007) warning that large datasets can be 
manipulated to yield questionable correlations. 

74 Bollier (2010:12) points out that “[it] is generally safer to use larger [datasets] from multiple sources” 
to avoid the risk of drawing the wrong conclusions. 
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In their discussion of the value that small data still holds, boyd and Crawford 
(2012:670) describe the work of Veinot (2007), who studied only one individual, a blue- 
collar worker at a hydroelectric power plant, with a view to studying workplace information 
practices. They conclude that small data may be superior to big data in some instances: 
“[Veinot’s] work tells a story that could not be discovered by farming millions of Facebook 
or Twitter accounts”. These sentiments are echoed by Bollier (2007:14) in The promise and 
peril of big data when he argues that more is not necessarily better; he quotes Stefaan Verhulst 
(currently co-founder and chief of research at the Governance Laboratory at New York 
University), who observes that “[people] quite often fail to understand the data points that 
they actually need, and so they just collect everything or just embrace Big Data. In many 
cases, less is actually more.” 

In further defense of the use of smaller data inputs, it is worthwhile noting that very 
big datasets may generate “dirty” or “biased” data (Kitchin & Lauriault 2015:466) which 
in turn will impact negatively on validity. Although big data proponents may be critical of 
small data for failing to reflect volume or velocity, “[s]mall data studies ... seek to mine gold 
from working a narrow seam, whereas big data studies seek to extract nuggets through open- 
pit mining, scooping up and sieving huge tracts of land” (Kitchin & Lauriault 2015:466). 
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The place of qualitative data analysis software 
(QDAS) programmes in a big data world 


Comprehension of [QDAS] as a facilitator and data management system, and not an 
alternative for data immersion and analysis, will serve the qualitative researcher well. 


— Diane Cope (2014:323) 


Despite being fairly well established, little is known about how qualitative researchers 
employ qualitative data analysis software (QDAS) programmes to analyse their datasets (cf. 
Woods, Paulus, Atkins & Macklin (2016:597), while even less information is available when 
it comes to the use of these programmes in the arena of big data. We thus consider the place 
of QDAS tools in the era of big data in this chapter, not only discussing recent developments 
in their design, but also critically appraising their usefulness to ‘big’ qualitative researchers 
by reviewing their advantages as well as their pitfalls. 


7.1 Software programmes and the qualitative researcher 


... [while qualitative data analysis software ... will not do the analysis for the 
researcher, it can make the analytical process more flexible, transparent and ultimately 


more trustworthy. — Florian Kaefer, Juliet Roper and Paresha Sinha (2015:1) 


According to political scientist and software inventor Stuart Shulman (2014), scholars 
consider themselves to be either purists, who prefer to focus exclusively on rich and in-depth 
interpretation of data, pluralists, who favour experimental, mixed methods, or positivists, 
who rely on scientific quantitative methods to achieve validity, reliability, and objectivity. A 
large number of qualitative researchers position themselves in between purists and pluralists 
(Evers 2018:66), and are seeking to employ qualitative data analysis software (QDAS) tools 
that support the analytic process that takes place in the mind (Evers 2018:65). Selecting an 
appropriate software tool is fairly challenging, given that many software tools such as Tableau, 
Statistical package for the social sciences (SPSS), and the free, open-source software referred 
to as R, are designed to support quantitative rather than qualitative research. Furthermore, 
the myth persists that different QDAS programmes reflect different methodological stances: 
ATLAS.ti, for example, appears to promote hermeneutics and grounded theory (Friese 2014), 
while MAXQDA is regarded as supporting mixed methods research (Guetterman, Creswell 
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& Kuckartz 2015). The assumption that QDAS tools tend to champion grounded theory 
in particular is challenged in the literature: “the perception of a grounded theory bias within 
[QDAS] is countered not only by a reminder that the functions offered by [QDAS] are 
employed within other methodological frameworks ... but also by a reminder that grounded 
theory is an ambiguous methodology” (Tummons 2014:5). Tummons (2014:5) notes that 
“theoretical vagueness” in qualitative research has helped fuel this erroneous assumption. 
He offers the caveat that researchers should not regard software programmes as driving the 
research process: “[QDAS] can provide the tools, but it cannot do the analysis” (Tummons 
2014:5). Indeed, it is no coincidence that Tummons (2014) qualifies the term QDAS with 
the adjective computer-assisted (CAQDAS) to emphasise the fact that software tools can only 
aid the researcher to carry out a variety of tasks such as data collection, storage, coding, and 
data visualisation (cf. Friese 2016:2). 


7.2 Two ways of thinking about QDAS 


...we too often rely on our intuition and routine thinking for big decisions when we 


should actually slow down and become more analytical. — John Reeves (2014:2) 


By now it should be clear that we are not advocating for big data analysis to replace a 
close, interpretative analysis of information. Instead, we are proposing that scholars merge 
computational and interpretative instruments as several scholars have done (Drucker 2011; 
Friese 2016; van Dijck 2016; Taylor, Gregory & Donaldson 2017). Van Dijck (2016:13) 
contends that such a merging is not new, and makes use of two analogies to prove his point: 
the magnetic resonance imaging (MRI) scanner has not entirely replaced X-rays, computed 
tomography (CT) scanners, and ultrasound, since all these instruments complement as well as 
overcome one another’s limitations by offering different and unique diagnostics. In addition, 
interpretation of each of these device's images does not occur automatically, but is the result 
of many years’ commitment to interpretation and fine-tuning of their features (van Dijck 
2016:13). Similarly, the microscope will not supplant the telescope as they are not competing 
instruments (van Dijck 2016:13)” and both require human interpretation of what they 
reveal to the naked eye. In the same way, combining interpretative and digital methods “does 
not mean that ... [we] ‘surrender’ to a new methodological paradigm” (van Dijck 2016:17) 
in either the humanities or social sciences. 

Friese (2016:35) offers researchers working in a qualitative paradigm useful advice 
when she argues that they should consider thinking about the development of QDAS tools 


75 Indeed, Charles Perrault (1693) observed that “with the help of the telescope and microscope it was 
possible to discover the immeasurable space in the largest and the smallest bodies, which gives an 
almost infinite extent to science, which engages with them”. 
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in terms of Nobel Prize winner Daniel Kahneman’s (2011) distinction between fast and slow 
thinking. In terms of Kahneman’s (2011:23) model drawn from behavioural psychology, 
“System 1” describes the brain’s ability to make rapid, intuitive, and automatic decisions, 
while “System 2” or slow thinking encompasses making choices and decisions at a more 
leisurely pace. Friese (2016:36) links these two systems to innovations in QDAS, asserting 
that in the past, software tools tended to accommodate only System 2, since they demanded 
a considerable amount of time and energy on the part of the researcher to read through data 
and then to manually interpret that data. She goes on to argue that in an era of big data, 
what is needed are tools that support System 1 as well, and indeed, many are now designed 
to do so, reflecting features that allow for rapid, automatic coding” and for the creation of 
word clouds (see Figure 7.1), word trees, tables, and a variety of graphical displays. ATLAS. 
ti, NVivo, MAXQDA, and QDA Miner are all examples of software tools that can store 
fairly large (but not enormous) datasets” whether text-, audio- or video-based, facilitate 
coding of data, generate word clouds and word trees, for instance, and create visualisations 
of data. 


76 When it comes to automatic coding, software tools have been designed in such a way that they allow 
for the coding of strings of words and for the identification of themes in the data collected. 


77 As noted in the previous chapter, qualitative researchers may work with very large datasets that big 
data analysts do not regard as reflecting all the Vs, namely, volume, velocity, variety, and veracity. 
Nevertheless, qualitative scholars tend to create large datasets by collecting and then linking small 
datasets (Kitchin & Lauriault 2015:463) that several QDAS tools are able to accommodate. 
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Figure 7.1: An example of a word cloud from South Africa’s Life Esidimeni Arbitration Hearings, 
24 and 25 January 2018” 


We are not suggesting that System 2 should be subordinated to System 1, however. 
Friese (2016) describes her own challenges when doing a research study which clearly 
illustrates that both fast and slow thinking are essential to the qualitative researcher, since 
an analysis based on the former type of thinking cannot be guaranteed to be accurate. 


Source: https://www.youtube.com/watch?v=bsoO8pkkt6o. A scandal and national tragedy hit South 
Africa on 1 February, 2017 when Health Ombudsman Prof. Malegapuru Makgoba released a report 
detailing the deaths of mentally ill patients at psychiatric facilities located in Gauteng. A total of 
144 individuals died, and neglect and starvation were listed as some of the causes of death. It is 
possible that more people died, but the exact number is not yet known. The hearings took place 
between September 2017 and March 2018, and were presided over by retired Deputy Chief Justice, 
Dikgang Moseneke. 
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Frieser (2016:37-38) employed System 1 tools to create a word cloud from a survey she 
conducted based on open-ended questions, with a view to determining which words 
occurred with specific themes. Although she made use of auto-coding, during which the 
software automatically sorted the words initially identified by theme, she realised that 
she nevertheless could not entirely trust the coding as she herself had to employ System 2 
thinking to manually check the auto-coded segments for accuracy. The situation becomes 
more complicated when one takes into account that QDAS tools (such as NVivo) have a 
tendency to either take a very long time to code data or to crash under huge amounts of data. 
Best practices to resolve or pre-empt these problems include storing data outside the given 
QDAS programme and cleaning a dirty database. With respect to data storage, Kaefer et al. 
(2015:7) suggest that frequent back-ups be made using Dropbox, Microsofts OneDrive or 
Google Cloud Platform, to name a few, although Li, Gai, Qiu, Qiu and Zhao (2016:103) 
warn that serious security issues still surround these cloud-based storage spaces in the sense 
that cloud service operators may access sensitive data.” When it comes to data cleaning, 
the literature at this stage remains heavily skewed in favour of quantitative data cleaning 
(Rousseeuw & Leroy 2005; Hellerstein 2008; Aggarwal 2003), but Chu and Ilyas (2016) 
provide taxonomies of qualitative error detection and data repair techniques for scholars 
working with big data. The first taxonomy pertains to detecting so-called surface anomalies 
by determining what, how, and where these anomalies occur. In this respect and in a detailed 
article, Chu and Ilyas (2016) provide scholars who have minimal training in software use 
with a user-friendly tutorial on how to answer the three questions posed. In terms of the 
second taxonomy, these researchers also provide a detailed explanation of what, how, and 
where to repair an erroneous database. Both error detection and error repair techniques may 
be either automatic or guided by humans.*° Using these cleaning techniques is essential to 
enhance the quality of data as discussed in Chapter 3 (cf. Cai & Zhu 2015; Fan, Xiao & 
Yan 2015). 

Another design challenge for QDAS developers which qualitative researchers need 
to be aware of lies in the field of sentiment analysis which involves coding opinions in a piece 
of text as either positive, negative or neutral. Friese (2016:38-39) remarks that what is highly 
problematic is that QDAS may sometimes incorrectly classify opinions. A major hurdle in 
this regard involves sarcasm and irony in social media, two phenomena which sentiment 
coding cannot easily detect (Farias 2017). Thus, for example, a detection engine may not 
classify Of course Id love to spend the last Rand I have on an expensive TV as being sarcastic. 


79 These researchers have proposed a unique cryptography method that effectively blocks cloud operators 
from accessing users’ data on cloud servers. Essentially, the method entails dividing sensitive data into 
two encrypted components and then storing them on different cloud servers. 


80 If data has been duplicated by mistake, for example, a machine will not be able to detect this as 
accurately as a human being can (Chen, Zobel, Zhang & Verspoor 2016). 
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Indeed, “[sarcasm] ... is usually ignored in social media analysis, because it is considered too 
tricky to handle” (Maynard & Greenwood 2014:4238).*! 


7.3 QDAS and blended reading 


Both close and distant readings reveal not just what is in a data set, but how that 
data might be enacted. — Yanni Loukissas (2016:4) 


Besides the above challenges, the voluminous amounts of data to be managed remain 
problematic and no software programme exists that, at the click of a mouse, has the ability 
to churn out accurate analyses of the information contained in the data. Initially conceived 
by Lemke (2014), a number of researchers have suggested that qualitative researchers 
employ what they call blended reading to analyse large data sets (Lemke, Niekler, Schaal 
& Wiedemann 2015; Hammond, Brooke & Hirst 2016; Stulpe & Lemke 2016; Loukissas 
2016). Blending reading combines distant and close reading, thus “integrating quantitative 
and qualitative analyses of complex data progressively” (Lemke et al. 2015:7) to provide a 
multifaceted view of data. A number of scholars have found a combination of close and 
distant reading of texts to be fruitful in film history (Hoyt 2014), digital research (Wills 
2016), and literary-historical scholarship (Taylor et al. 2017), to name a few areas. Here, we 
briefly describe studies by Rauscher (2014) and Reiberg (2016) to give some indication of 
how distant and close reading may be combined. 

In his exploration of literary representations of cities in crime novels, Rauscher 
(2014) combined a qualitative close reading of fairly large quantities of literary texts with 
computer-assisted QDAS analysis, corpus linguistics, and distant reading in an iterative 
research process, moving constantly between close and distant reading. Rauscher’s (2016) 
study is a useful one in the sense that it shows how crime novels have the potential to 
serve as a rich “data basis for urban sociology and interdisciplinary research questions about 
the distinctiveness of cities” (Rauscher 2014:68). Significantly, Rauscher (2014) exploited 
thick description to analyse patterns of discourse in and across 240 novels, a method that 
is becoming increasingly important to enrich (big) data analytics (Felt 2016) and which 
we discussed in Chapter 3. Using the term thick analysis, but with acknowledgement 
to Clifford Geertz (1973) for having coined the term thick description, Evers (2015:1) 
contends that although time consuming, thick analysis is useful for “[enhancing] the depth 
and breadth of data analysis by creatively combining several analysis methods, allowing for 
a more comprehensive analysis” of data. Evers (2015:7) also asserts that QDAS programmes 


81 Having said this, researchers who adopt a computational linguistic approach to sentiment analysis 
have actively worked on identifying irony and sarcasm (Reyes, Rosso & Veale 2013; Sulis, Farias, 
Rosso, Patti & Ruffo 2016; van Hee, Lefever & Hoste 2018). 
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that offer hyperlinking, visualisation, and annotation tools, for example, will help augment 
thick description. A number of qualitative software programmes allow for thick description 
including NVivo (cf. Mwangi 2017), MAXQDA (cf. Goldenberg, Darbes & Stephenson 
2017), and ATLAS.ti (Ang, Embi & Yunus 2016). What is particularly advantageous 
about these programmes is that if a team of researchers is working on a particular study, 
each researcher has the opportunity to code the data independently and to explain his/ 
her reasoning via thick description of the data via a given programme's memo function.” 
As Brokensha and Conradie (2016:9) observe, the memo function “not only [allows] 
researchers to immerse themselves in the data, thus avoiding the so-called ‘tactile-digital 
divide’ (Gilbert 2002, 216), but also affords greater transparency because researchers can 
share insights, challenges, and doubts (Tracy 2010, 842)” they may harbour about their 
individual interpretations (cf. Tummons 2014:8). 

Reiberg (2016) explored the first five years of Germany's Internet policy with a 
view to gaining insights into how this policy was socially constructed by political actors in 
debates reflected in two large corpora, namely, parliamentary minutes (from 1996 to 1998) 
and approximately 37 000 articles published in five major newspapers between 1994 and 
1998. Since the corpora were so large, making it impossible to conduct a close reading of 
the content in them, Reiberg (2016) initially carried out distant reading of 129 articles 
from one newspaper to determine the frequency of terms related to the word ‘Internet 
and its synonyms/synonymous terms. Next, he selected content that pertained strictly to 
Internet policy, further reducing the corpus through a list of keywords that were synonyms 
of Internet policies or reflected the names of organisations that focused on Internet policy. 
This list was then employed to both select and process the remaining articles, and the corpus 
was analysed in terms of the names of Internet policies. Identifying only those Internet 
policies that were enacted between 1996 and 1998, Reiberg (2016) searched for terms that 
were synonymous with these policies before moving on to the second phase of his research. 
In this phase, Reiberg (2016) carried out both distant and close reading in the sense that he 
employed a distanced look to identify broad themes in the corpora before using qualitative 
content analysis to conduct a close reading of the reasoning used by political actors to 
generate two types of statements — either societal problem statements or demands for state 
intervention in the context of the domain of Internet policy. This close reading allowed 
Reiberg (2016) to generate a codebook to describe the features of each type of statement. In 


82 Memos, along with graphical representations and search results, for example, are referred to as 
secondary documents in the literature on qualitative data analysis software. The practice of generating 
both primary and secondary documents is referred to as system closure. Managing both kinds of 
documentation in a given software programme “makes it a simple task for the researcher to search 
and then code his or her ongoing analytical or explanatory material using the same coding structure 
as has been used for the primary data” (Tummons 2014:7). 
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a third and final phase, each type of statement was analysed to identify their similarities and 
determine political actors’ shared understandings of the new Internet policy domain. What 
Reiberg (2016:10) ultimately concluded was that the term “information society”® was one 
frequently used by political actors to construct a shared understanding of Internet policy. It 
is notable that Reiberg made use of MAXQDA to carry out his study, since it is a software 
programme that allows the researcher to carry out System 1 and System 2 analyses of coded 
data. In addition, and as Friese (2016:40) puts it, such a software programme “[supports] 
qualitative data analysis rather than offering to [analyse] the data”. 

Studies by Rauscher (2014), Reiburg (2016), and others offer a number of 
important lessons, one being that a close reading of large datasets is not an impossible task 
when combined with computational analysis (cf. Hammond et al. 2016). However, a more 
important lesson is that close reading is by no means obsolete in a big data world, contrary to 
what Jockers (2013:7) maintains in the context of literary history when he writes that “[close] 
reading is not only impractical as a means of evidence gathering ... but big data render it 
totally inappropriate as a method”.*4 Hammond et al. (2016:74) are of the view that rather 
than perceiving close and distant reading to be parallel structures for interpretation as Jockers 
(2013) does, scholars should “use computational analysis to test, probe and enliven human 
close readings”. That is, they should adopt a hybrid approach which reflects a feedback loop 
in which distant and close readings are in a continual and reciprocal dialogue. This is the 
approach that Rauscher (2014) adopted when he studied how cities are depicted in crime 
novels: he refers, for example, to employing a feedback loop in which interpretations based 
on close and distant readings constantly challenge one another. Rauscher (2014:96) observes 
that for in-depth interpretation of data, “distant reading and visualization alone are not 
sufficient”, while a “particular and limited close reading can (and should) be enhanced and 
enlarged to a wider context” through distant reading. 

Ultimately, qualitative researchers need to decide for themselves whether or not 
they would like to work with much larger datasets. Friese (2016:44) calls on scholars to at 
least consider using CAQDAS when analysing their data, paraphrasing Hitchcock (2014) 
when she states that “CAQDAS and ‘Big Data’ absolutely need to have a conversation. The 
subject of that conversation should be on how (or whether) to integrate close reading and 
small data, and distant reading and large data”. Qualitative software tools are in a sense at 
an embryonic stage, and much work still needs to be done to refine them so that they are 


83 This terms “refers to an inevitable, positive change societies in general are undergoing at different 
speeds” (Reiberg 2016:10). 

84 Interestingly, and in the next sentence, Jockers (2013:7) confusingly remarks that “[this] is not to 
imply that scholars have been wholly unsuccessful in employing close reading to the study of literary 
history”. He cites the works of Ian Watt and Erich Auerbach as excellent examples of close readings 
of texts. 
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able to accommodate increasingly large amounts of information. Nevertheless, these tools 
may open doors scholars never imagined existed. Van Dijck (2016:17) puts it eloquently 
when he observes that “[as] academic guardians of the arts, culture, language, heritage, and 
the traditions of humanities thinking, we will have to engage in multifarious ways with 
the interrelatedness of digital technology in all kinds of cultural practices”. This statement 
applies equally well to scholars working in the social sciences (cf. Smith 2017). 


7.4 QDAS, qualitative content analysis, and big data 


In the era of “big data’, the methodological technique of content analysis can be the 
most powerful tool in the researcher’ kit. — Steven Stemler (2015:1) 


Although content analysis as a method reflects a number of limitations related, amongst 
other things, to issues of validity and intercoder reliability% as well as to the problem of 
being time-consuming, several scholars recommend that QDAS tools be combined with 
qualitative content analysis since this method allows scholars “to systematically transform a 
large amount of text into a highly organised and concise summary of key results” (Erlingsson 
& Brysiewicz 2017:94; cf. Elo & Kyngäs 2008:113; Lewis, Zamith & Hermida 2013:34; 
Renz, Carrington & Badger 2018:824). Indeed, Kaefer, Roper and Sinha (2015:1) argue 
that when combined with QDAS tools, this method improves both the transparency and 
trustworthiness of the qualitative research process in the context of big data. However, there 
are a number of guidelines that researchers should follow when combining QDAS and 
qualitative content analysis (QCA) which we briefly address here. 

Unlike quantitative content analysis, which emphasises objective, quantitative, and 
systematic descriptions of messages (Berelson 1952:18), QCA focuses instead on context, 
aiming for “the subjective interpretation of the content of text data” (Hsieh & Shannon 
2005:1278) which avoids “rash quantification” (Mayring 2000:1). Kaefer et al. (2015:1- 
2) suggest that scholars make a distinction between qualitative data analysis (QDA) and 
QCA because the former generates interpretations based on an entire body of texts, while 
the latter reflects a (usually smaller) quantitative component. Two major criticisms levelled 
at QCA in the context of big data research are — paradoxically — that the use of software 
has a distancing effect in the sense that it removes the researcher from the data, while being 
too deeply immersed in the data prevents the researcher from seeing the bigger picture, as 
it were (cf. Bazeley 2007; Ryan 2009). We noted in the previous section that a number of 


85 Validity may be called into question if the coding process is neither consistent nor coherent (cf. 
Renz et al. 2018:825), while establishing reliability becomes a problem the moment coders working 
independently of each other are unable to re-code the data and classify categories membership in the 
same way (cf. Bolognesi, Pilgram & van den Heerik 2017:1988). 
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scholars have resolved these problems by making use of blended reading (e.g., Lemke et al., 
2015; Hammond et al. 2016; Loukissas 2016). Kaefer et al. (2015:1) suggest that a similar 
approach be used when they refer to “[a] multi-level approach to QDAS-assisted analysis” 
because it enables scholars to achieve both “closeness for familiarity ... [and] distance for 
abstraction and synthesis” (Bazeley 2007:8): 


[U]sually the researcher closest to the data brings in the awareness of the 
complexities in the data, while other researchers bring in the abstraction 
necessarily for synthesis at the final stage of the analysis. However, they 
do this without being familiar with the coding process and the data. 
Thus abstraction is imposed on to the data. In this case as the coding and 
analysis processes are transparent, the focus on familiarity can be switched 
to abstraction, followed by synthesis as some members will still remain 
distant from the data, while being fully aware of how the data was coded 
(Kaefer et al 2015:17). 


Asignificant pitfall of QCA pertains to the tendency that scholars relatively unfamiliar 
with content analysis might have to over-code the data they have collected, particularly 
because software makes it so easy to code that data. Marshall (2002:61) summarises this 
dilemma succinctly when she asks “If ‘there is always something to be found’, then even if one 
has reached ‘theoretical saturation where no new themes emerge, has one finished coding?”. 
Solutions to over-coding at the analysis stage are to have both “well-defined questions and a 
clear, step-by-step procedure” in place before the analysis commences (Kaefer et al. 2015:17). 


7.5 Beyond traditional databases 


...the sheer volume and variety of [primary research material] make it difficult to 
access through the traditional approaches. - Dmytro Karamshuk, Frances Shaw, 
Julie Brown and Nishanth Sastry (2017:33) 


In conclusion, and with respect to data management and storage, software tools such as 
ATLAS.ti, NVivo, MAXQDA, and QDA Miner are all examples of software tools that can 
manage and store relatively large amounts of data in, for example, text-, audio-, and video- 
based formats. As far as analysis is concerned, “while qualitative data analysis software ... 
will not do the analysis for the researcher, it can make the analytical process more flexible, 
transparent and ultimately more trustworthy.” (Kaefer et al. 2015:1). 

In this chapter as well as in the previous one, we have argued that appropriate data 
size in a big data world depends very much on (a) whether researchers regard data as big only 
if it conforms to all four Vs or on (b) whether, as is the case for traditional humanists and 
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social scientists, the argument is that volume is not the only attribute that makes big data 
“big”. Nevertheless, we cannot deny that QDAS is unable to handle and store big data as 
defined by researchers in category (a), since it “pushes at the limits of traditional databases 
as tables of rows and columns, and requires new ways of querying and leveraging data for 
analysis” (Amoore & Piotukh 2015:4). It is for this reason that we take a closer and far more 
technical look at how the big data ecosystem is geared towards managing huge datasets in 
the next chapter. 
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The nitty-gritty: Big data infrastructure 


Infrastructure is the cornerstone of Big Data architecture. — Eileen McNulty (2014:1) 


Thus far, we have interrogated big data from a number of perspectives, but without exploring 
the nitty-gritty of big data — the nuts-and-bolts of this phenomenon, as it were. Its emergence 
reflects rapid advancements in the development of new technologies often coupled with 
“overlapping technology waves” (Demchenko, Grosso, Laat & Membery 2013:1), and in 
this environment, it is vital that humanists and social scientists interested in examining huge 
amounts of data familarise themselves with the various big data architectural components 
which make up the big data ecosystem. 


8.1 Managing the unmanageable 


Computer systems have become a vital part of the modern research environment, 
supporting all aspects of the research lifecycle. — Savas Parastatidis (2009:165) 


As discussed in previous chapters, big data refers to a collection of data that is growing at 
such a rapid rate that traditional database technologies cannot accommodate the collection, 
hence the need for newer types of technology (Provost & Fawcett 2013; Vaisman & Zimányi 
2014). Big data therefore differs from small data and traditional analytics. As we have already 
noted, big data is generally associated with three features or Vs, namely, volume, variety, and 
velocity (Storey & Song 2017). These three features are used in the literature to highlight 
the fact that big data involves more than just massive amounts of data that are generated, 
stored, and analysed. Big data also takes on many different disparate and incompatible 
data formats, which include structured, semi-structured or unstructured data sources and 
which are created in (near) real-time (Davenport, Barth & Bean 2012; Jagadish, Gehrke, 
Labrinidid, Papakonstantinou, Patel, Ramakrishnan & Shahabi 2014; Wamba, Akter, 
Edwards, Chopin & Gnanzou 2015).%° A major challenge inherent in big data research is 
that datasets need to be cleaned and processed before they can be integrated into a big data 


86 Near real-time (or NRT as it is also referred to) describes data integration that takes place on a 
regular basis. Thus, data, may be collected on a daily basis or every few minutes or hours: “[t]he time 
taken between when data arrives and is processed is very small, close to real time” (Chandio, Tziritas 
& Xu 2015:9). Real-time data, on the other hand, refers to data that “arrives and is processed in a 
continuous manner, which enables real-time analysis” (Chandio et al. 2015:9). 


98 


The nitty-gritty: Big data infrastructure 


system. In response to this challenge, IBM coined the fourth factor called veracity, to address 
the element of uncertainty that arises when it comes to data quality. Touched on in Chapter 
3, veracity entails managing data quality in an adequate fashion (cf. Buhl, Roglinger, Moser 
& Heidemann 2013:67), and simply refers to “the truthfulness of the data” (Powers Dirette 
2016:2). 

To use big data as an information asset, innovative data processing technologies are 
required to generate insights and facilitate decision-making. Scholars unfamiliar with big 
data may assume that technologies have existed since the dawn of the computer age to store 
and process massive datasets. What makes big data unique, however, is the nature of the 
overwhelming data flows which in turn drive the need for fundamental changes to be made to 
computing architectures and data processing mechanisms. Jim Gray, a data software pioneer, 
called this driving force the fourth paradigm as far back as 2007, and pointed out that this 
new paradigm, or “data-intensive science” (Bell, Hay & Szalay 2009:1298) constitutes the 
only way to cope with the management and visualisation of huge datasets.*” These datasets 
also demand the repetitive activity of storing data over a long period of time. Examples 
include a logger writing millions of visits to a webpage into a weblog, or a cellphone database 
storing the details of each call from all handsets every 15 seconds (Jacobs 2009). 

Not only must a big data system handle large amounts of continuously generated 
data, but it must also provide a stable and scalable environment for storing, analysing, and 
mining the given datasets (Hu, Wen, Chua & Li 2014). This provides an interesting 
challenge to researchers who rely on traditional data storage systems, since these systems are 
used to store structured data and are repeatedly queried by a relational database management 
system (RDBMS).® With the arrival of semi-structured and unstructured data, these 
systems are stretched far beyond their original system design, since they are now required 
to handle ad hoc queries, as well as a single batch query which could take several hours to 
complete. It is widely accepted that traditional RDBMSes and structured query language 


87 Jim Gray’s (2007) presentation can be found at http://microsoft.com/en-us/um/people/gray/talks/ 
NRC-CSTB_eScience.ppt. In a presentation made to the Computer Science and Telecommunications 
Board in California on 11 January 2007, Gray remarked that “[t]he techniques and technologies for 
... data-intensive science are so different that it is worth distinguishing data-intensive science from 
computational science as a new, fourth paradigm for scientific exploration” (http://itre.cis.upenn. 
edu/myl/JimGrayOnE-Science.pdf). Sadly, it is the very last presentation posted by Gray as he was 
lost at sea on 28 January 2007. 


88 Scalability is the ability of a network, software or process to manage huge amounts of data. 


89 A relational database management system (RDBMS) is a tool data analysts utilise to create, 
update, and manage the structured data they collect which is then stored in tables, very much like 
Excel spreadsheets. 
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(SQL)” cannot be used in this way owing to their relational architecture and adherence 
to the ACID”! (atomicity, consistency, isolation and durability) properties of a RDBMS 
(Krishnan 2013; Hu et al. 2014). Furthermore, traditional RDBMSes are incapable of 
managing the volume and velocity of the new types of data generated by sensor networks, 
machine loggers, clickstream analysers and e-commerce platforms, since relational data is 
stored in pre-defined schemas (Juki¢, Sharma, Nestorov & Jukié 2015). 

As a result nearly all major Internet companies such as Oracle, IBM, Microsoft, 
Google, Amazon, Facebook, and Yahoo, to name a few, have been compelled to initiate 
big data projects. These projects have yielded ground-breaking and emerging fourth 
paradigm technologies to manage as well as analyse and visualise big datasets. Some of these 
technologies include, but are not limited to, Apache Hadoop, MapReduce, Hive, and Spark. 
In subsequent sections, we shed some light on how the fourth paradigm may play a role in 
the humanities and social sciences by providing an overview of innovative and state-of-the- 
art big data technologies and their associated applications. 


8.2 Big data systems 


... big data presents unique systems engineering and architectural challenges. 
— Edmon Begoli and James Horey (2012:215). 


The architectural layout of big data systems is often complex and consists of several layers 
which include applications, data tools, information and communications technology (ICT) 
infrastructure, and service level agreements (SLA). This complexity can be bewildering 
to researchers who do not have a computer science background. In order to gain a better 
understanding of these layers, it is sometimes easier to conceptually view big data systems as 
a culmination of technology, people, data, and processes instead of a complex layered system 
often depicted in the literature (Kim, Jeong & Kim 2014). In this way, a more traditional 
value-chain systems-engineering approach can be followed which divides a big data system 
into several consecutive phases, which include data generation, acquisition, storage, and 
processing (Hu et al. 2014). However to appreciate these phases, the applications, data tools, 
and ICT infrastructure often associated with big data systems need to be introduced first. 
The application layer, the first entry point into a big data system, typically employs a 
programming model to implement various data analyses functions, which include querying, 


90 SQL is a language that enables data analysts to interact with relational databases. What this means 
in practical terms is that a data analyst is able to access, insert, update or delete data to and from a 
relational database. 


91 “... ACID ... is a set of properties that guarantee[s] that transactions are processed 


reliably” (Miller 2013:145). 
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statistical analysis or recommendation engines (Zhang, Yang, Chen & Li 2018:72). Potential 
applications will then tap into the layer to gain insight into and value from the given big 
datasets. Domains that have invested heavily in big data systems include healthcare (Wang 
& Hajli 2017; Wang, Kung & Byrd 2018), the public sector (Desousa & Jacob 2017; 
Maciejewski 2017), the retail sector (Li & Wang 2017), and the financial industry (Seddon 
& Currie 2018). 

In the social sciences, big data systems are currently being utilised in disciplines 
such as geography (Gao, Li, Li, Janowicz & Zhang 2017), cultural sociology (Bail 2014), 
political science (Colleoni et al. 2014), and history (Graham, Milligan & Weingart 2016). 
In the context of humanities scholarship, the use of big data systems is becoming popular 
in education/learning analytics (Kellen, Recktenwald & Burr 2013), musicology (Pugin 
2015) and in the digital archiving of literary texts or art (Matusiak, Meng, Barczyk & Shih 
2015). (See Chapter 6 in which the applications of big data to social science and humanities 
scholarship have been discussed in greater detail.) 

The next layer is the computing layer, which consists of data tools (Hu et al. 
2014). The toolset includes tools to integrate, manage, and make data accessible for the 
programming model or application layer. These tools include, among others, state-of-the- 
art and innovative distributed file systems (cf. Howard et al. 1988; Ghemawat, Gobioff & 
Leung 2003), NoSQL (‘not only SQL) databases (cf. Cattell 2011), MapReduce (cf. Dean 
& Ghemawat 2008), YARN (cf. Kulkarni & Khandewal 2014) Hadoop (cf. White 2015), 
and Spark (cf. Reyes-Ortiz, Oneto & Anguita 2015). 

With the application and computing layer conceptually defined, an infrastructure layer 
is required with consists of several ICT resources. This layer may be physical infrastructure 
on-site, or it may use cloud computing that enables virtualisation. What is important to note 
is that this layer is responsible for data storage, networking, and computation functions. 
Direct attached storage (DAS), Network attached storage (NAS) and Storage area network 
(SAN) devices are organised into a network architecture of storage systems. The storage 
systems can either be disk oriented (DAS), file oriented (NAS) or block oriented (SAN) 
(Hu et al. 2014). With the three layers conceptually defined, a more in-depth discussion is 
required about the phases in big data systems which include data generation, acquisition, 
storage, and processing. 


8.3 Data generation 


You may ... think of [velocity] as the frequency of data generation or the frequency of 
data delivery. — Philip Russom (2011:7) 


Data generation is a highly diverse phase often producing complex datasets generated by 
distributed data sources. These data sources include databanks, webpages, social media, 
sensors, and mobile data. Essentially, data is distributed across a number of data sources and 
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currently, big data sources are businesses or enterprises, computer networks that include the 
Internet and Internet of Things (IoT)” as well as scientific applications. 

The internal data of business or enterprises mainly consists of operational data 
sources such as human resource data, production data, inventory data, sales data, and 
so forth. Most of these data sources are structured and historical in nature and focus on 
capturing day-to-day activities which are managed by RDBMSes (Ponniah 2010). In order 
to turn this data into strategic information, IT departments have worked hand-in-hand 
with the business sector over the past few decades with a view to improving profitability 
through better decision-making. However, to continuously increase the value of strategic 
information, real-time analysis is required. For example, the retail corporation Walmart both 
processes and stores approximately 2.5 petabytes of customer data per hour, which is the 
equivalent of more than one million transactions every hour (Kitchin 2014a:71, cf. Marks 
2016:194). Similarly, Facebook loads more than 60 terabytes of new data and stores more 
than 15 petabytes of information on a daily basis (Khan, Naqvi, Alam & Rizvi 2015:298). 
According to the US-based IT research and advisory company Gartner, approximately 8.4 
billion devices were connected to the Internet of Things in 2017; it is predicted that this 
number will increase to 20.4 billion within the next two years” (cf. Li, Da Xu & Zhao 
2018). In terms of scientific data, applications are generating increasingly large datasets that 
rely on big data analytics for insight creation. For example, the Centre for European Nuclear 
Research (CERN) recorded storing more than 200 petabytes of data in its tape libraries on 
29 June 2017% (Gaillard & Pandolfi 2017:1). 

However, to be viewed as a big data generation source, large volumes of data 
should be generated at a very high velocity. Furthermore, the data formats should include 
semi-structured or unstructured formats instead of the traditional structured format. 
Structured data refers to data entities organised and stored in a structured manner such 


92 The term Internet of Things is not easy to define (Wortmann & Flichter 2015:221), since it is an 
evolving paradigm. One definition is that it “is used as an umbrella keyword for covering various 
aspects related to the extension of the Internet and the Web into the physical realm, by means of the 
widespread deployment of spatially distributed devices” (Miorandi, Sicari, DePellegrini & Chlamtax 
2012:1497). It also been defined as “a paradigm where everyday objects can be equipped with 
identifying, sensing, networking and processing capabilities that will allow them to communicate 
with one another and with other devices and services over the Internet to accomplish some objective” 
(Whitmore, Agarwal & Da Xu 2015:261). 


93 https://www.gartner.com/newsroom/id/3598917. 


94 This data emanated from CERN’s Large Hadron Collider (LHC). The LHC’s computers store 
approximately 15 petabytes of data annually which, according to Harford (2014:14), translates into 
“15,000 years’ worth of [someone's] favourite music”. 
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as XML documents”, or database tables in a relational database system (White 2015). 
Like structured data, semi-structured data has a schema or structure, but is less organised or 
structured (Katal, Wazid & Goudar 2013; Kim, Trimi & Chung 2014). A spreadsheet from 
Microsoft Office or Google Sheets is a good example of a semi-structured source. Examples 
of unstructured data include social media messages, webpages, photographs, and audio files 
(Minelli, Chambers & Dhiraj 2013; White 2015). 


8.4 Data acquisition 


Some data sources ... can produce staggering amounts of raw data. Much of this data 
is of no interest ... — Alexandros Labrinidis and Hosagrahar Jagadish (2012:2032) 


Currently a number of data sources are capable of producing overwhelming amounts of 
raw data, but the data acquired may be useless for a number of reasons. For example, if 
a researcher's aim is to continually analyse real-time data, it may be unusable if it is not 
captured on time. In this regard, real-time processing software remains in its infancy (Cai & 
Zhu 2015). Other challenges which we have noted in this book are those related to improper 
data representation (Chen, Mao & Liu 2014:175) and “dirty” data, which is duplicate 
data, data that contains errors or data that is incomplete/outdated (Liu, Wang, Li & Gao 
2017:644). The acquisition of big datasets reflects three phases which are data collection, 
data transmission or transportation, and data pre-processing (Chen, Mao, Zhang & Leung 
2014). Three of the most popular approaches for acquiring big data include log files, sensors, 
and web crawlers (Hu et al. 2014). Log files are one of the most widely adopted approaches 
for collecting big data since they are generated by source systems to record activities taking 
place on database systems and web servers. Sensor data, on the other hand, is the output of 
digital devices after detecting an input in the physical environment. These sensors include 
accelerometers, photo sensors or smart-grid sensors. Sensors are an integral component 
of the Internet of Things environment (Chen eż al. 2014). Network data, which includes 
web pages, is acquired through a web crawler which is often applied in search engines or 
web caching. 

When data is acquired, the raw data is transferred to a data storage facility for 
processing and analysis. These data storage facilities, commonly referred to as data centres, 
consist of physical media such as fibre optic cables and an interconnected network that 
offers high throughput and low latency. Owing to the diverse variety of data sources which 


95 Similar to HTML, in that it also contains markup symbols that are employed to describe the content 
of a file (or page), XML (Extensible Markup Language) reflects rules that enable documents to be 
encoded in human- and machine-readable formats. (The markup symbols in XML are unlimited, 
which is not the case when it comes to HTML.) 
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often contain noise (meaningless information), redundancies, and anomalies, data should 
be pre-processed to avoid transferring and storing data that cannot be used. Pre-processing 
techniques of big datasets include data integration, data cleansing, and redundancy 
elimination (Hu et al. 2014). These techniques are well established in the field of data 
warehousing, and even though they are referred to variously in the literature as extraction, 
transformation, and loading (ETL) (Santoso 2017:95), all ultimately deal with improving 
data quality (Kimball & Caserta 2004), which we have already addressed in this book. 


8.5 Data storage 


Big data storage requirements are complex and thus needs a holistic approach to 
mitigate its challenges. — Rajeev Agrawal and Christopher Nyamful (2016:1) 


Once data is acquired and pre-processed, it must be prepared for storage, analysis, and 
value extraction by a big data platform. We have already noted that this poses a significant 
dilemma for any traditional data storage system, since traditional RDBMSes are not capable 
of handling very large amounts of data (Patel 2017:125). Coupled to this is the challenge 
inherent in most datasets being either semi-structured or unstructured in nature. For these 
reasons, a data storage infrastructure should be flexible and reliable, providing a scalable 
access interface for data processing and queuing (Hu et al. 2014). 

Storage technologies, such as direct attached storage (DAS), network attached storage 
(NAS), and storage area network (SAN), are responsible for storing the data collected (Hu et 
al. 2014). However, an additional challenge comes into play since the data stored should also 
be organised in an efficient way so that it can be processed effectively. Big data management 
frameworks are known for their ability to facilitate this process and are comprised of three 
layers, which are distributed file systems, NoSQL databases, and programming models. 


8.5.1 Distributed file systems 


File systems are the foundation of any computing system and are employed to control 
how data is stored and retrieved. Considerable research, mainly driven by large Internet 
companies, has been devoted to improving these systems for the era of big data. Both the 
Google File System (GFS) and Facebook’s Haystack are well-known examples of distributed 
file systems developed in an attempt to meet data processing needs. The GFS was the result 
of the so-called “Big Files” effort by Google co-founders Larry Page and Sergey Bin (Lydia 
& Swarup 2015:391), and was designed for system-to-system interaction and not for user- 
to-system interaction (Gemayel 2016:67). The GFS utilises inexpensive community servers, 
commonly known as computer clusters, to provide a scalable distributed file system for large 
distributed data-intensive applications (Ghemawat et al. 2003). The GFS is thus regarded 
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as a popular distributed file system (Lee, Lee, Choi, Chung & Moon 2011), with many 
GFS clusters currently deployed world-wide. Released for the first time in 2010, a new 
version of the Google File System (the GFS2) is code named Colossus, and promises to 
provide a considerable performance advantage over the first version. Facebook, by contrast, 
designed Haystack to store and process the massive amounts of photographs in their big 
data application (Beaver, Kumar, Li, Sobel & Vajgel 2010). The number of photographs 
uploaded to Facebook is staggering; at present about 14.58 million images are uploaded each 
hour, which translates into 350 million every day (Aslam 2018:3). 


8.5.2 NoSQL databases 


NoSQL (or ‘no SQL) databases together with distributed file systems have become the 
de facto standard to store and manage big datasets (Cattell 2011). NoSQL databases can 
be roughly grouped into four types, namely, key-value stores, column-oriented stores, 
document databases, and graph databases. Each of these types of NoSQL databases organise 
data in a different data model. 

A key-value NoSQL database, for instance, has a simple data model where data is 
stored as a key-value pair and each key is unique (Hu et al. 2014). Well-known examples of 
this type of database include Voldemort (developed by LinkedIn.com) and Dynamo, used 
by Amazon’s e-business platform to obtain data-driven recommendations from major data 
(Provost & Fawcett 2013; Chen et al. 2014). 

Column-oriented databases store and process data by columns rather than by rows 
as is the case in relational databases and was inspired by Google’s Big Table (Hu et al. 2014). 
Popular examples of these kinds of databases include Cassandra (Laksham & Malik 2010; 
Haseeb & Puttun 2017) and HBase (Perkins, Redmond & Wilson 2018). Cassandra was 
developed by Anish Lakshman (one of the authors of Amazon’s Dynamo) and Prashant 
Malik at Facebook to power the Facebook inbox search feature and was open-sourced in 
2008. Cassandra” is a fault tolerant, decentralised database, meaning that it has no single 
point of failure, and excels at real-time transactions and data analytics. HBase is an open- 
source clone of Google’s Big Table and leverages the distributed data storage capabilities 
provided on top of Hadoop and the Hadoop Distributed Files Systems. The capabilities of 
HBase are elaborated upon when we introduce Hadoop in Section 8.7. 

Unlike key-value stores, document databases are able to support more complex 
data structures because the data is stored as documents and represented in JSON format 
(Krishnan 2013:86). JSON or JavaScript Object Notation is a recognized data-interchange 
format and easy for researchers to read and write (Izquierdo & Cabot 2016:52). What makes 
the JSON format quite appealing for researchers is that it is computer language independent 


96 http://cassandra.apache.org/. 
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and thus employed among a wide range of computer programming languages. Standard 
examples of document stores include MongoDB, Couchbase, CouchDB, as well as Riak 
(Krishnan 2013:86). According to DB-Engines,”’ which ranks database management 
systems, MongoDB is currently the most popular NoSQL document database among 
users (cf. Columbo & Ferrari 2015:146), the main reason being that the database does 
not require a predefined schema, allowing for flexible storage strategies when changing and 
updating documents. 

The final type of NoSQL database is called a graph database. In graph theory, 
a graph is a mathematical structure that consists of nodes (vertices) connected by edges 
(arcs), while a node is usually represented as a dot and edges by a link in diagrams (Celko 
2014:208). Graph databases are known for their ability to represent links and relationships 
between relevant data nodes using graph theory. However, graph databases are not suitable 
for computations and aggregations, and are therefore predominately used to store and 
represent data as a graph (Celko 2014:208). It is also viewed as one of the most complex 
NoSQL database types and has developed as a consequence of the rapid growth in data 
from social media data (Krishnan 2013:97). Well-known examples of this type of database 
include Neo4j (Jordan 2014:11) and OrientDB (Kiiciikkececi & Yazici 2018:35). Neo4j’® is 
a widely-used graph database according to DB-Engines;” it is implemented in Java and uses 
Cypher Query Language to query the data.'” All objects in Neo4j are stored as either an 
edge, node or an attribute.'' Unlike Neo4j, OrientDB'” is an open-source NoSQL multi- 
nodal database that combines graph theory with key-value as well as document- and object- 
oriented models into a single database. In other words, OrientDB is a graph database where 
every edge and node is a document. This allows for increased functionality and flexibility, 
making OrientDB a second generation NoSQL database. It is envisaged that multi-nodal 
databases will soon replace traditional RDBMes as the preferred big data store, due to their 
ability to accommodate multi-format datasets of very high volume (Assay 2015). 


97  https://db-engines.com/en/ranking. 
98  http://neo4j.com/. 
99 Ibid. 


100 Cypher Query Language is a declarative, SQL-inspired language employed by researchers to visually 
describe patterns in data using circles for nodes and lines for relationships. 


101 Nodes represent entities such as people, accounts, or businesses, and are very much related to a 
record in a relational database. Edges are the lines that connect the nodes and represent relationships, 
while attributes constitute the information about the nodes. In ‘John likes Baroque’, for example, 
‘Johr and ‘Baroque’ are the nodes, ‘like’ is the relationship between the two nodes, and ‘music’ is the 
attribute or property. 


102  http://orientdb.com/. 
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8.5.3 Programming models 


As noted earlier, the application layer make uses of the computing layer as the bridge between 
the application and infrastructure layer. Programming models are known for their ability to 
link the underlying hardware (infrastructure layer) with the software as well as for their tools 
designed not only to integrate and manage data, but also to make it accessible (application 
layer). In other words, a programming model offers NoSQL databases and distributed 
file systems the functionality for querying and analysing big data sets. The programming 
model is therefore a critical component in any big data architecture. Prior to the big data 
era, traditional programming models such as OpenMP (Dagum & Menon 1998) and 
MPI” (Walker & Dongarra 1996) were used as parallel models. However, these models 
differ from the model required to implement parallel computations on GFS, for example. 
What is required is a generic process model to process and store big data sets. Some of the 
most important programming models include MapReduce, Dryad, Pregel, GraphLab, $4, 
and Storm (Hu et al. 2014). It should be pointed out that a programming model is not a 
programming language and is designed to be used by programmers, rather than business 
users. In addition, a programming model exists independently of a programming language. 


8.6 Data analysis 


The massive amounts of high-dimensional data bring both opportunities and new 
challenges to data analysis. — Jianqing Fan, Fang Han and Han Liu (2014:293) 


Data analysis is arguably the most important stage of any big data value system since its 
goal is to extract value from the big data sets. This can then be used to improve decision- 
making in an organisation or to improve organisation inefficacies. In the context of an 
academic environment, big data analysis may be employed to predict, for instance, future 
performance and examine patterns of student performance over time, since large quantities 
of longitudinal student data can be stored (Daniel 2015). Data visualisation (see Chapter 3), 
statistical analysis, and data mining are currently successfully employed in several big data 
applications for these purposes (Hu et al. 2014). Some of these applications include text 
mining, web mining, multimedia analytics, and structured data analytics. 

Text mining, which is also referred to as text analytics, describes the techniques 
exploited to extract meaningful information from unstructured textual data such as emails, 
blogs, social media content, webpages, online forms, documents or call centre logs (He, Zha 
& Li 2013). The techniques include computational linguistics, statistical analysis as well as 
machine learning, and the goal is to identify models, trends, and patterns from textual data 
that may be useful to the researcher (Sabherwal & Becerra-Fernandez 2011:89-90). 


103 Message passing interface. 
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Some of the major applications of text mining include information extraction, text 
summarisation, question answering (QA), and sentiment analysis or opinion mining (Gandomi 
& Haider 2015). Information extraction employs information retrieval (IR) systems to 
extract relevant facts from documents. In other words, it extracts structured data from 
unstructured text. Text summarisation uses algorithms to produce tags for a document by 
analysing words, sentences, and phrases in a document. These tags are then employed to 
product a summary of a document which could be an email, blog, or news article, for 
example (Kim et al. 2014). Question and Answering systems make use of NLP techniques!“ 
to provide answers to questions that human pose. Examples of these type of systems include 
Apple’s Siri and IBM’s Watson. Sentiment analysis is used to determine whether a text, or part 
of it, is subjective or not and, if subjective, whether it expresses a positive, negative or neutral 
view (Liu 2015). Sentiment analysis may be performed at the document level (Zhang, Zeng, 
Li, Wang & Zuo 2009), sentence level (Riloff & Wiebe 2003; Appel, Chiclana, Carter & 
Fujita 2016) or even at aspect or entity level (Popescu & Etzioni 2007). 

Whereas text mining focuses on the process of extracting information from 
unstructured data, web mining aims to retrieve and extract information from online web 
data, which could include text, HTML (Hypertext Markup Language), hyperlinks or even 
multimedia data (Kim et al. 2014). Web mining makes use of similar techniques used 
in text mining such as IR and natural language processing (NLP). Web mining can be 
categorised into three areas of interest: web content mining, web structure mining, and 
web usage mining (Kosala & Blockeel 2000). Web content mining is utilised to analyse 
the content of web pages to discover relationships among documents or to the analyse the 
text content itself. Web structure mining, by contrast, examines how web documents are 
structured and determines the hierarchy of the underlying hyperlinks. This is particularly 
useful in uncovering relationships between a web site and similar web sites. Web usage 
mining, also known as clickstream analysis, examines web server logs to expose surfers’ 
behaviour and patterns, the ultimate goal being to uncover paths customers follow through 
a company’s website. By analysing these paths, companies are able to identify web pages that 
are infrequently visited, or reveal broken hyperlinks on web pages (Sabherwal & Becerra- 
Fernandez 2011:92). Clickstream analyses are widely employed by academic researchers 
too. In one study, for example, a team of researchers used clickstream analysis to glean 
insights into which attributes of online social networks tend to attract and retain their 
participants (Scheider, Feldmann, Krishnamurthy & Willinger 2009). In another useful 
study, scholars analysed, among other things, the clickstream patterns of tertiary students 


104 Natural language processing (NLP) entails giving computers the ability to process human language. 
NLP techniques allow computers to analyse and understand text, and include lexical acquisition, 
word sense, disambiguation, part-of-speech (POS) tagging, probabilistic context free grammars, and 
probabilistic parsing (Manning & Schiitze 1999). 
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majoring in science with a view to developing better video-assisted learning practices for 
teachers (Giannakos, Chorianopoulos & Chrisochoides 2015). 

Multimedia analytics examines multimedia data such as images/photos, videos, 
and audio. Mining multimedia data is generally seen as highly complex, and requires more 
computational power than mining numeric or textual data (Kim eż al. 2014). The main aim 
of multimedia analytics is to extract interesting knowledge from multimedia and to gain 
insights into the semantics as captured in the data (Hu et al. 2014). Significant multimedia 
analytics research focuses on multimedia summaries and multimedia annotation, to name 
just two areas of interest. When it comes to the former, for example, Bian, Yang, Zhang and 
Chua (2015) have generated multimedia summaries of social events in a given microblog 
stream. This kind of summarisation is useful for a number of reasons. First, it affords 
researchers the opportunity to obtain a preliminary overview of the data at their disposal 
before an exhaustive analysis is made. Second, it “allow[s] for ... targeted access to data, 
and ... [builds] the basis for visualization techniques” (Blank, Henrich & Kufer 2016:67). 
With respect to annotation, a team of researchers has shown how user-generated comments 
(UGCs) on You Tube may be used for what they call “multimedia annotation verification” of 
videos (Bajaj, Kavidayal, Srivastava, Akhtar & Kamaraguru 2016:53). They have discovered 
that without accurate verification, You Tube users are unable to successfully retrieve specific 
videos they may be searching for. What makes this particular study useful is that while 
researchers have focused on verifying the credibility of textual content generated by online 
users (Eidenbenz 2012; Nguyen, Yan & Thai 2013; Hocevar, Flanagin & Metzger 2014), few 
have focused on detecting misleading metadata as it relates to videos (cf. Bajaj et al. 2016). 

Structured data analytics entails the analysis of large quantities of structured data 
gathered by business or scientific applications. As mentioned earlier, structured data sources 
are managed by RDBMSes and data warehouses are often used to integrate all the data 
sources (Inmon 2005). In terms of taxonomy, data analytics can be achieved according 
to (1) descriptive, (2) predictive, and (3) prescriptive analytics. Examples of these type 
of analysis include using regression to find a trend in historical data, or predicting future 
probabilities and trends, or just address decision-making and improve efficacy. Technologies 
used to perform data analytics include online analytical transaction processing (OLAP), data 
visualisation, data mining, and statistical analysis (Ponniah 2010). 

A number of scholars working in the field of education/learning analytics have 
applied or combined the three different kinds of analytics to study their unique contexts 
(cf. Daniel 2014). Some scholars have used descriptive analytics to predict and improve 
students’ success (Dietz-Uhler & Hurn 2013), for instance, while others have employed 
predictive analytics to track at-risk students in an attempt to pre-empt course failure (Milne, 
Jeffrey, Suddaby & Higgins 2012; Zacharis 2015). Prescriptive analytics utilises both 
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descriptive and predictive analytics to “[help] institutions of higher education assess their 
current situation and make informed choices on alternative ... [courses] based on valid and 
consistent predictions” (Daniel 2015:915). 


8.7 The Hadoop ecosystem 


[Hadoop] is a flagship technology which became the center of gravity for an entire 


ecosystem. — Fullestop’” 


During the last few years, the computing infrastructure used to process and store big data 
has changed significantly. This includes NoSQL databases (Cattell 2011), Hadoop-related 
databases (White 2015), Spark data processing engines (Zaharia et al. 2010), computer 
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clusters, in-memory databases, and massively parallel-processing (MPP) databases, 
name a few. Regarded as one of the most widely adopted frameworks in the world and 
the cornerstone of modern big data systems (Chen et al. 2014; Mavridis & Karatza 2017), 
Hadoop was introduced in 2007 as an open-source implementation of Google’s MapReduce 
programming model (White 2015), but has grown into a web of projects consisting of several 
open-source software components.'”” Currently, Hadoop provides scalable, distributed, and 
parallelised computing on clusters of inexpensive servers for data collection, storage as well as 
processing (Eckerson 2011; Rahman & Iverson 2015). This complex system which comprises 
several related projects, tools, and technologies, is often called the Hadoop ecosystem, with 
some tools making the writing task easier or orchestrating more complex tasks. 


8.7.1 Hadoops core components 


According to Hadoop’s website,'°* the “Hadoop project” consists of three main components, 
which are the Hadoop Distributed File System (HDFS), MapReduce, and YARN (or “Yet 
Another Resource Negotiator”). The three base components form the basis of a Hadoop 
ecosystem; however, the ecosystem may consist of many other technologies as well, such as 
an NoSQL database (HBase), a data warehousing system (Hive), a platform to manipulate 
the data (Pig), a tool to efficiently transfer bulk data (Sqoop), or a tool for machine learning 
(Mahout). This set of technologies complements one another and should not be viewed 
as separate components. Each of the three main components is explored below and this is 
followed by a discussion of what constitutes a Hadoop software stack. 


105 https://www.fullestop.com/blog/a-peek-into-the-hadoops-ecosystem/. 


106 MPP isa parallel processing hardware option where multiple processes work on different parts of a 
programme. 


107 See section 8.5.3 for an explanation of programming models. 
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The HDFS 


The Hadoop Distributed File System or HDFS was designed to store large amounts of data 
across multiple nodes of commodity hardware inan HDFS cluster (White 2015). The HDFS 
cluster usually consists of (1) a single NameNode that manages the file system’s metadata, 
and (2) a collection of DataNodes that store the actual data (Hu et al. 2014). Since the 
HDEFS is file-based, it does not require a data model (as in the case of an RDBMS) to save or 
process the data and can store files in any format. Once a file is uploaded onto the HDFS, 
the file is divided into blocks. The blocks are then distributed between computers within 
the HDFS cluster and are duplicated to store multiple copies of each block on multiple 
computers within the HDFS cluster (White 2015). With this ability, HDFS provides the 
perfect environment for storing structured, semi-structured and unstructured data, while 
simultaneously enabling parallel processing via MapReduce applications (Watson 2014). 


MapReduce 


MapReduce is considered to be a popular data processing engine since it is easy to use, is 
powerful, and enables the automatic parallelisation of computations on large clusters of 
commodity servers (White 2015). MapReduce was originally developed by Google for the 
generation of data for its production web services, for sorting and machine learning, and 
to scale over large clusters of machines (Dean & Ghemawat 2008). At present, however, 
MapReduce is employed as a programming model that enables the implementation of 
applications associated with processing and generating large datasets. This is mainly possible 
because the programming model uses the distributed file system to store and process the data 
and consists of machine code for processing and generating large datasets (Sharda, Delen 
& Turban 2014:587). This allows programmers with limited experience in parallel and 
distributed systems to develop applications that are automatically parallel and distributed 
in nature. These applications may be developed using either C, Java, Ruby or Python'” 
(Watson 2014; White 2015). A typical MapReduce application consists of two parts, a Map 
phase and a Reduce phase. The map phase converts raw data into value-key pairs, while the 
reduce phases processes the data in parallel using a cluster (Landset, Khoshgoftaar, Richter 
& Hasanin 2015). The end product of these phases is a file that may either be loaded into 
a data warehouse or analysed through the use of big data analytic tools such as Tableau”? or 
Gephi.''' Tableau is a commercial data visualisation tool often used by business intelligence 


109 C, Java, Ruby, and Python are programming languages which allow researchers to use a set of 
instructions to produce various kinds of outputs. These instructions will generally receive input 
and implement a specific algorithm. 


110 https://www.tableau.com/. 
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professionals to build interactive and shareable dashboards, scoreboards, and reports. Gephi, 
on the other hand, is an open-source and free data visualisation tool suited for research such 


as social network analysis. 


YARN 


YARN, which is Hadoop’s cluster resource management system, has been used since 
Hadoop 2.0 to improve MapReduce implementation performance (White 2015). Prior to 
the introduction of YARN, Hadoop and MapReduce were tightly coupled and MapReduce 
was responsible for both cluster resource management as well as data processing. In YARN 
(regarded as MapReduce 2), MapReduce handles only data processing, while YARN is now 
responsible for cluster resource management. This division of tasks means that the new 
ecosystem does not only scale better, but can also accommodate more nodes, thus improving 


on the original Hadoop 1.0 ecosystem. 


8.8 The Hadoop software stack 


Choosing the right technologies and tools, such as the right solution stack, is an 
important part of the architectural challenges we try to solve. — Tom Smith (2016:1) 


As mentioned earlier, the Apache Hadoop software library consists of several projects, 
tools, and technologies often collectively referred to as the Hadoop ecosystem illustrated 
in Figure 8.1 below. To simplify this vast web of projects, tools, and technologies, the 
structure of an ecosystem is described in this section in terms of its storage, processing, and 


management layers. 


MapReduce (Processing model) 


YARN (Cluster Resource System) 


HDFS (Distributed File System) 


Flume (Distributed data service} Sqoop (Data import and export) 


Figure 8.1: The Hadoop ecosystem (adapted from Hu, Wen, Chau & Li 2014:678) 
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The storage layer is the lowest level of the software stack and includes the HDFS, 
(described previously) and HBase, a NoSQL database. Both the HDFS and HBase are 
responsible for data storage. It is important to re-iterate that HDFS is not a database, but a 
file storage system designed for scalability and fault tolerance. HBase, a column oriented non- 
relational distributed database runs on top of the HDFS and uses its scalability to provide a 
distributed storage engine. Records are stored in tables and columns and like a RDBMS, 
the tables must have a primary key (PK) to retrieve records when queried. The difference is 
that many attributes are stored as column families (Landset et al. 2015). Since HBase is a 
NoSQL database, it does not support SQL and when any SQL-like command is executed, 
the command must be translated into a Java-like equivalent (Rahman & Iverson 2015). 

The processing layer is where the actual processing and analysis take place, and its 
foundation is MapReduce and YARN. In addition to these processing frameworks, this layer 
includes tools for data acquisition such as Flume!” and Sqoop''? (Basha, Kumar & Babu 
2016:126), which were developed to assist with data movement and integration. Flume is a 
distributed service that collects and aggregates large amounts of data from multiple sources 
and then loads it to a centralised data store or HDFS. Sqoop, on the other hand, handles the 
import and export of data between relational databases and Hadoop. For example, Sqoop can 
send data from a MySQL or Oracle database! to HDFS, perform a MapReduce task, and 
send the HDFS MapReduce results back as an import to a relational database (Intel 2013; 
Celko 2014). Sqoop therefore plays an important role in importing data from a relational 
database to Hadoop and vice versa. It also enables researchers to consolidate structured data 
(from different sources) with unstructured data (from NoSQL sources) on a single Hadoop 
storage system. Other tools include a query engine such as Hive (Story & Song 2017) and 
a scripting language such as Pig! (Gates & Dai 2016). Hive was developed by Facebook 
to bring the concepts of tables, columns and SQL (from the relational database world) to 
the Hadoop ecosystem. Hive also allows users to organise and partition big data sets into 
tables and provides HiveQL, an extension of ANSI SQL to write queries. However, one of 
the drawbacks of MapReduce is that algorithms and code developed in Java, Python, or C, 
for example, can become very complex, particularly for users not familiar with MapReduce 
programming. Pig Latin was developed by Yahoo! as a high-level declarative language to 
offer abstractions and hide the complexities associated with programming MapReduce jobs. 
Pig supports user-defined functions written in Python, Java, and JavaScript, and translates 
MapReduce jobs internally to MapReduce tasks, without the programmer having to manage 


112 http://flume.apache.org/. 

113 http://sqoop.apache.org/. 

114 MySQL and Oracle are examples of relational database management systems (RDBMS). 
115 http://pig.apache.org/. 
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the conversation. Pig is therefore ideal for programmers who do not enjoy programming in a 
higher-level language and are accustomed to developing scripts. In a sense, both Hive and Pig 
are “SQL-like” languages providing data warehousing capabilites to the Hadoop ecosystem. 

The management layer includes tools for high-level organisation such as scheduling, 
monitoring, and co-ordination. Example of these tools include ZooKeeper and Chukwa 
which are utilised to monitor and manage distributed applications performed on Hadoop (Hu 
et al. 2014). ZooKeeper was originally developed by Yahoo! to make it easier for applications 
to access configuration information, but has grown to such a level that it can now co-ordinate 
and synchronise applications across distributed computer clusters (Warden 2011:10). 
Chukwa is a Hadoop sub-project that serves as a data collection system to monitor and 
manage large-scale systems (Krishnan 2013). Chukwa is built on HDFS and the MapReduce 
programming model, and has flexible and powerful tools for displaying, monitoring and 
analysing collected data (Chen et al. 2014). 

Even though most big data systems rely on Hadoop, there are scenarios that require 
real-time data streaming and processing instead of batch processing. Two stream processing 
systems include Spark"! and Storm!” (Shoro & Soomro 2015), which make it possible to run 
real-time, distributed computations on streams of data store and emit the results to Hadoop 
(White 2015). Spark was originally developed by the University of California, Berkeley on 
the MapReduce framework (Zaharia et al. 2010), but is now an Apache project. Spark is 
seen as a new generation distibuted processing engine and is faster and more flexible than 
MapReduce (Landset et al. 2015). In conjunction with its batch processing option, Spark 
also offers micro-batching using Spark Streaming. This approach partitions an incoming 
stream into chunks of data, which can then at a later stage be batch processed (Shahrivari 
& Jalili 2014). This is, however, not true real-time streaming, but does allow load balancing 
and also offers integration of stream and batch processing for an online application such as 
clickstream analysis (Zaharia et al. 2010). A typical Apache Spark ecosystem will consist of 
Spark SQL for structured data, GraphX for graph processing, MLLib for machine learning 
and Spark Streaming for micro-batching incoming data (Databricks 2016).'!® Whereas Spark 
does not entail true real-time streaming, Storm was developed as an open-source distributed 
stream processing engine with the purpose of processing data in real-time. This option allows 
researchers to eliminate sources of latency, as well as deploy real-time analytics and online 
machine learning. In words, transactions can be processed in seconds and the data can be 
transferred to a big data system from where real-time predictions can be made using machine 
learning. Storm is already successfully used by companies such as Groupon, Alibaba, and 
Twitter Analytics. 


116 https://spark.apache.org/. 
117 _ https://storm.apache.org/. 
118  https://databricks.com/spark/about. 
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8.9 An example of a Hadoop big data system 


. our traditional systems are not capable enough [to perform] the analytics on 
... data which is constantly in motion. — Avita Katal, Mohammad Wazid and 
Rayan Goudar (2013:404) 


Figure 8.2 below provides a good example of what a Hadoop big data system looks like. 
Specifically, it illustrates a myriad of both structured and unstructured big data sources 
processed by ETL tools and then stored in HDFS. On the right-hand side of the illustration, 
batch processing via Hadoop and/or Hive is illustrated The output is then used for analytics, 
business intelligence, and data visualisation usings tools such as Excel or Tableau. The left- 
hand side of the visual representation illustrates how streaming data is processed in real-time 
by Storm and stored in real-time structured databases such as HBase and Cassandra. This 
processed data can be transferred to Hadoop and/or Hive using Sqoop and used by analytics 


tools such as Business Objects and Microstrategy. 


— 
. ‘ 


Data visualisation (Excel, 
Tableau) 


Real-time structured database (HBase, MongoDB, 
Cassandra) 


Batch processing (Hadoop, Hive) 
Structured and unstructured data (HDFS) 


Figure 8.2: A big data system (adapted from Ibarra 2012:3) 
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8.10 Commercial big data systems and cloud big data 


Building scalable, assured big data systems is expensive. — Yan Gorton (2014:6) 


Not all big data systems are available as open source software platforms such as Apache 
Hadoop or Apache Spark. Several of the large traditional database vendors also offer their 
own big data systems, usually packaged with their proprietary RDBMS software, such as 
Oracle Big Data!” and IBM DB2 for Big Data.'”? Another commercial data platform is 
SAP HANA,”! an in-memory, column-oriented RDBMS for real-time analytics. However, 
commercial big data systems are not limited to the database or enterprise resource planning 
vendors. Quite a number of companies have created commercial big data systems that 
employ their own versions of Hadoop, recognising that one of the burning issues with 
regard to an open source platform is that despite being free for anyone to use or modify, 
it still requires specialist knowledge to set it up. Exploiting this need, several commercial 
versions have appeared on the marketplace with vendors creating their own versions which 
are easier to use. Some of the commercial versions include Cloudera Hadoop, Hortonworks 
Hadoop, EMC Hadoop, Microsoft Hadoop, Intel Hadoop, and MapR (Davenport 2014). 
It is interesting to note that some of the most important applications that make use of 
commercial big data systems include targeted marketing, social media analytics, and website 
recommendation engines (Loshin 2013). 

Purchasing and setting up big data infrastructure can be very expensive, but 
fortunately, alternative solutions are available and such infrastructure can be set up using 
cloud computing services. Cloud computing services are essentially a new style of delivering 
applications, data, and resources over the Internet because they allow researchers to rely on 
a third party (cloud provider) who uses a number of interconnected computers to provide 
a data service. These data services are currently offered by companies such as Amazon EC2'” 
and Microsoft Azure,'” who charge a user a fee based on storage space and processing time 
(Gunarathne, Wu, Qiu & Fox 2010). Alternatively, cloud computing infrastructure could 
be used for the infrastructure layer (instead of a pool of ICT resources) and enabled by 
virtualisation technology (Hu et al. 2014), which refers to the ability to create a virtual rather 
than actual version of technology, such as virtual computer hardware platforms. !* 


119 http://www.oracle.com/. 

120 http://www.ibm.com/. 

121  https://www.sap.com/products/hana.html. 
122 http://aws.amazon.com/ec2/. 

123 http://azure.microsoft.com/. 


124 Virtual machines (VM) are a well-known example of hardware virtualisation. 
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8.11 A reminder 


The mythic power of big data is part of what unifies it as a concept and informs its 
legibility as a set of tools. — Kate Crawford, Kate Miltner and Mary Gray (2014:1664) 


We would not want to create the impression that the above quotation constitutes a rejection 
of the big data ecosystem. Throughout this book, our message has been clear; it is a call to 
consider using big data in the social sciences and humanities, given the huge amounts of data 
available, while at the same time interrogating big datas methods and assumptions, with a 
view to improving the ways in which research is conducted (cf. Crawford et al. 2014:1665). 
In Chapters 2 and 5, we considered how big data is framed and noted that the era of big data 
has “[not] been precipitated by technology alone” (Crawford et al. 2014:1664), although 
some might argue that the advent of Apache Hadoop “[reinforces] the idea of big data’s 
newness” (Crawford et al. 2014:1663). 

What we cannot refute are that current trends in technology such as wearable 
computers, the Internet of Things, and mobile sensors are all contributing to mountains 
of “big” heterogeneous data. Turning this data into useful information is vital in our 
quest to obtain actionable knowledge that can be used to improve the world around us 
— that is, knowledge that will not only aid us in decision making, but also help us analyse 
complex problems (cf. Cao 2015:288). Throughout history, we have continued to overcome 
many hurdles in fulfilling this quest. Now in the year 2019, we are facing a new obstacle, 
which is essentially too much data coupled with too little wisdom. Big data technologies 
offer us the ability to improve our collective learning, and provide an architecture 
that facilitates data collection, storage, processing, and analysis of massive amounts of 
information. We are also of the opinion that the emergence of cloud computing will become 
an increasingly important information technology platform (as an alternative to expensive 
computer clusters). With the promise of faster networks, as can be seen in the design of 5G 
wireless networks, for example, there is a real possibility of big data becoming mobile. New 
business models such as Big Data as a Service (BDaaS), or Big Data Analysis as a Service 
(BDAaaS) are exciting possibilities for the social scientist and humanist who would like to 
embrace big data and technology to promote scientific progress. Some of these possibilities 
pertain to exploring crowdsourcing (O’Leary 2014; Mulder, Ferguson, Groenewegen, 
Boersma & Wolbers 2016), big data social networks (Scott 2017; De Nooy, Mrvar & 
Batagelj 2018), and “smart-cities” (Li, Cao & Yao 2015; Giest 2017), all of which provide 
useful insights into human activities that could potentially foster more “self-aware” societies. 
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Leveraging social scientific and humanistic 
expertise in the world of (big) data science 


Humanists have distinct abilities to examine data from multiple angles while 
interpreting trends and outliers. — James Shulman (2017:1) 


Integration of social science into research is crucial. — Ana Viseu (2015:292) 


With the vast amounts of data available these days, companies and academic institutions alike are 
striving to find new ways to exploit it to gain a competitive advantage over their rivals, but this 
requires deep insights which cannot be generated by machines, but only by humans with rich 
analytical skills. Previously, institutions would make use of statisticians or modellers to analyse 
and explore their datasets and this was often done employing manual methods. As computers 
have become increasingly powerful, more data has been produced and at the same time, 
more powerful algorithms!” have been developed to connect new datasets and enable deeper 
exploration. This has given rise to a fairly new practice called “data science”,'”° which involves the 
extraction of information and insights from complex data in order to detect meaningful patterns 
(Wang et al. 2018:8). Of course, this is a rather simplistic definition which leaves humanists, 
social scientists, and even data scientists a little confounded. 

Before we take a look at the confusion surrounding the term data science, we would like 
to point out that we have opted to use it as an umbrella term for data analytics, machine learning, 
and data mining, to name a few. We argue that traditional humanists and social scientists 
who choose to work with big data may see themselves as conducting data science and/or data 
analytics. Those who regard themselves as carrying out data science research will be interested 
in all aspects of data science, from data cleansing and preparation to data analysis, employing 
statistics, programming, mathematics, and the like (cf. BaSkarada & Koronios 2017; Kelleher & 
Tierney 2018). Researchers who are interested in data analytics, by contrast, will be concerned 
with generating meaningful insights from their datasets (cf. Concessao 2017); the majority of 
humanists and social scientists tend to fall into the latter category. 


125 Algorithms, which may be defined as sets of instructions for solving tasks, have been in existence 
for almost 4000 years. The first known algorithm was created around 2000 BCE in Mesopotamia. 
Classical algorithms include Sumerian-Babylonian root extraction (circe 1700 BCE), the Euclidian 
algorithm (fourth century BCE), and the approximation of the circle number z (third century BCE) 
(Brudener 2018). 


126 Asa scientific term, “data science” came into being approximately 17 years ago (Cao 2016:1). 
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9.1 What is data science? 


Whats in a name? — Peter Lake and Robert Drake (2014:104) 


In order to understand what data science entails, it is useful to deconstruct the term as it 
is open to different interpretations (cf. Lake & Drake; 2014; Zhu & Xiong 2015) owing 
mainly to the fact that it is still regarded as an emerging discipline (Rose 2016:3; Donoho 
2017:745). According to the Oxford English Dictionary (2019), “data is facts and statistics 
collected together for reference or analysis”, and its origin can be traced back to the mid- 
seventeenth century when data was used as the plural of datum, which literally refers to “a 
piece of information”. Science on the other hand is defined as “the intellectual and practical 
activity encompassing the systematic study of the structure and behaviour of the physical 
and natural world through observation and experiment” (Oxford English Dictionary 2019). 
We could thus argue that data science refers, at least to some extent, to the ability to analyse 
and interpret distinct pieces of information (i.e., data) about the natural and physical world 
employing a systematic approach that uses both observation and experimentation. However, this 
definition does not do justice to the diverse viewpoints adopted by scholars who variously 
argue (1) that data science is the study of data produced and employed in scientific studies 
(Dhar 2013), (2) that it involves the study of business data (Provost & Fawcett 2013:56) 
(3) often with a view to solving scientific and business problems (Svolba 2017), and (4) 
that it reflects the integration of computing technology, statistics, and artificial intelligence 
(Dhar 2013) (Zhu & Xiong 2015:2-3). 

The term data scientist is equally difficult to delineate!” because academic research 
about data scientists and what they do is virtually non-existent (Choi & Tausczik 2017:2). 
In an earlier chapter, we found it insightful to explore the various metaphors used in the 
media and in academic environments to define big data with a view to understanding 
how this phenomenon is framed, and we repeat this exercise with the term data scientist. 
Usefully, research professor in open and distance learning at the University of South Africa, 
Paul Prinsloo (2016:348), has summarised a few of the fascinating metaphors used to 
describe data scientists. In popular media, data scientists are variously labelled as “gods” 
(Bloor 2012), “high priests” (Dwoskin 2014), “game changers” (Chatfield, Shlemoon, 
Redublado & Rahman 2014), and even “rock stars” (Sadkowsky 2014). In the current 
literature on data science and big data, academic scholars are also increasingly referring 
to data scientists as “unicorns” (Anderson 2015; Harris & Eitel-Porter 2015; BaSkarada 
& Koronis 2017) — individuals who are steeped in the whole range of skills required to 


127 While the Oxford English Dictionary makes no reference to data science, it does define a data scientist 
as “a person employed to analyse and interpret complex digital data ...”. 
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carry out (big) data science research.'”* Significantly, these scholars are not calling for data 
scientists to become unicorns, however. Instead, the metaphor is being used to show how 
such scientists tend to be romanticised in the literature: “[a] perfect data scientist is often 
described as a ‘unicorn’ because it is impossible for an individual to have all the skills needed. 
Renowned data scientists have urged their field to make use of more teams because it is so 
difficult for any individual to gain a complete skillset” (Choi & Tausczik 2017:2). The myth 
of the data scientist as unicorn is unfortunately driven by industry and position articles 
in popular media. Articles with titles such as “How to become a unicorn data scientist’,'” 
“The hunt for unicorn data scientists lifts salaries’ (Press 2015:1), and ‘What’s the secret 
source to transforming into a unicorn in data science?’ (Kesari 2018:1) are becoming quite 
prevalent on the Internet. To be fair, there are just as many articles online that are challenging 
the existence of such a mythical creature, and in the academic environment, researchers are 
also questioning the feasibility of being able to train data scientists who are au fait with all 
areas of data science (van der Aalst 2016; Grant 2017; Ohri 2017; Stadelmann, Stockinger, 
Biirki & Braschler 2018). 

While it is not problematic to define what specific scientists such as political, social 
or climate scientists do given that they are the products of easily identifiable disciplines, 
this is not the case when it comes to data scientists, since they come from a wide variety 
of different fields such as mathematics, statistics, data analysis, and information engineering 
(Rose 2016:3). Doug Rose (2016:4) contends that individuals from these fields might appear 
to be “a better fit for the title ‘data scientist’ than others”, yet “[i]f [researchers have] worked 
with numbers and [know] a little about data, [they] could call themselves data scientist[s]”. 
Irrespective of where they come from, what all data scientists have in common is “[their] 
focus on the science and not the data” (Rose 2016:4). Interestingly, a common misconception 
persists that data scientists work only with large datasets and that the terms data science and 
big data are thus interchangeable (Jagadish 2015:51). Believing that data science equates to 
big data is a common error in light of the fact that the two are often discussed in the same 
breath (cf. Rose 2016:6).!*° 

Given that there are numerous views when it comes to what data scientists are, not 
all scholars are in agreement as to what their skills and responsibilities should encompass. 
In 2014, Vincent Granville distinguished between the “horizontal data scientist” and the 
“vertical data scientist” in his book Developing analytic talent. The former type of data scientist 


128 The term “unicorn” appears to have been coined in 2013 by Aileen Lee, founder of the venture capital 
company Cowboy Ventures (Ohri 2017:3). 


129 https://whatsthebigdata.com/2015/10/17/how-to-become-a-unicorn-data-scientist-and-make- 
more-than-240000/. 


130 As Jagadish (2015:51) observes, using the two terms as though they were synonymous “is not 
completely inappropriate: [however] the primary difference between the two terms is their perspective: 
‘Big Data begins with the data characteristics ... whereas ‘Data Science’ begins with data use ...”. 
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is someone who possesses deep technical knowledge in a narrow field. Thus, for example, 
computer scientists may be regarded as horizontal data scientists because they are familiar with 
programming languages, algorithms, data structures, software development methodologies, 
and the like. In addition, these scientists have cross-disciplinary knowledge (Granville 
2014:76) because they are able to blend a number of fields such as computer science, statistics, 
and machine learning in addition to possessing domain expertise. By contrast, vertical data 
scientists are individuals with a narrow set of technical skills coupled with very little domain 
expertise. Such individuals are, according to Granville (2014:78), “fake data scientist[s]”. 
What is rather interesting is that in Granville’s (2014:5) view, a researcher analysing 10 000 
rows of data (as opposed to 10 million, for example) is conducting “fake data science” because 
the amount of data does not conform to all the Vs identified in the big data arena. However, as 
we have noted numerous times, big data is relative; as Jagadish (2015:50) so aptly puts it, size 
is “a moving target” because what counts as big is forever shifting: “What we consider big now 
is not the same as what we considered big five years ago or what we will consider big five years 
from now” (Marsden, Shirai & Wilkinson 2018:33). What makes datasets regarded as small 
by some big data scientists “big” in the humanities and social sciences is that scholars in these 
disciplines are discovering that they can no longer use traditional methods to analyse them. 
More importantly, complex data rather than big data is the focus for these scholars, since they 
are often compelled to work with data from multiple sources (cf. Marsden et al. 2018:35). 
According to well-known computer and data scientist Wil van der Aalst (2014), the 
main rationale behind data science is to provide answers to a number of generic questions such 
as “What happened?”, “Why did it happen?”, and “What will happen?” using big data as the 
new oil. Based on the literature, the essential skills a data scientist should acquire to answer 
these questions are those in machine learning, data visualisation, advanced data management, 


(data) storytelling,’ 


and problem-formulation/problem-solving skills, while a background 
in mathematics, statistics and/or computer science accompanied by domain expertise are 
also important (Patil 2011; Davenport & Patil 2012; Dhar 2013; Harris, Shetterley, Alter & 
Schnell 2013; Anderson, Bowring, McCauley, Pothering & Starr 2014; Holtz 2014; Mills, 
Chudoba & Olsen 2016). The typical skillset and areas of knowledge of a data scientist are 
depicted in the form of a mind map in Figure 9.1, while the mind map in Figure 9.2 highlights 
many of the tasks required of a data scientist. It should be noted that our focus is on the skills 
and knowledge of (big) data scientists, rather than on those of big data developers, engineers 
or business analysts (De Mauro, Greco, Grimaldi & Nobili 2016). 


131 In the era of big data, “data explorers have become the narrators of the stories that data is trying to tell. 
‘The power of narrative in scientific data lies behind transforming information into knowledge that 
provides better understanding of complex matters. This enables scientific data explorers to achieve 
the goal of creating engagement and raising awareness of the message that is being communicated” 
(Arboleda & Dewan 2017:38; cf. Yoder-Wise & Kowalski 2003:37). Data visualisation is one 
component of storytelling. 
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Figure 9.1: The skills and knowledge of a data scientist 
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Figure 9.2: The various tasks performed by a data scientist 
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Ben Stenhaug, a “data science for social good” fellow at Stanford University, is 
mindful that it is challenging for anyone who has no background in (big) data science to 
learn about the skills and tasks summarised in Figures 9.1 and 9.2 in any formal way. Indeed, 
Stenhaug (2017:3) points out that many graduate students at Stanford “[are] struggling to 
cobble together their own data science education”. The situation is not very different in 
South Africa since data science is still a practice rather than an established discipline (cf. 
Nongxa 2017:1). Both in South Africa and abroad, data science degrees are not geared 
towards helping non-data scientists easily transition into (big) data science, while many 
resources around data science are simply “too deep [and] too fast” (Stenhaug 2017:3) for the 
novice data scientist. This does not, however, mean the end of the road for anyone interested 
in carrying out data science research. Statistician and data miner Meta Brown (2017:2-3) 
argues that based on her own teaching experiences, “bare minimum requirements” are needed 
to begin conducting (big) data science research, and these requirements include a knowledge 
of general mathematics (linear algebra), probability and statistics, familiarity with at least 
one computer language (such as Python, R or SQL), competence in applications such as 
spreadsheets and word processing, and good communication skills. Brown (2017:3) goes so 
far as to add that when hotly challenged about this advice by an individual who contended 
that a Masters degree in data science is essential, she simply replied, “Swear all you like, 
brother, I know what I’m talking about”. Academics at UC Berkeleys Division of Data 
Sciences would certainly agree with Brown: they offer a course in analysing cultural data 
through a combination of humanities and machine learning approaches, and students do not 
need a background in data science or digital humanities to complete the course.'*” The same 
division also offers a number of other courses such as ‘Foundations of Data Science’ which 
reflects no data science, computer science or statistics prerequisites.'** Stanford University 
offers a ‘Data Science Minor for the Humanities and Statistics’ course that does not require 
any programming or statistical background and that is aimed at helping students develop 
data-analytic methods that are directly related to their fields of interest." 

A number of scholars suggest that humanities and social science graduates consider 
a career as a data scientist (cf. Antonijević 2015; Salmon 2017). A good example of an 
individual who has combined a humanities degree with data science is a computational 
linguist who makes use of both linguistics and computer science (Jurafsky & Martin 2009) 
as we noted in Chapter 6. In linguistics, a scholar’s focus may fall on sociolinguistics, 
dialectology or corpus linguistics, to name a few branches of this discipline. In recent times, 
these three areas have begun harnessing advanced quantifiable methods to analyse big 


132 http://data.berkeley.edu/making-sense-cultural-data. 
133 _ https://data.berkeley.edu/education/courses/data-8. 


134 _ https://mcs.stanford.edu/news/new-data-science-minor-humanities-statistics-course. 
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datasets, methods which are often associated with natural language processing (NLP) which 
is a sub-field of computer science. It is well known that text and web mining of social media 
data are popular data science approaches, often leading to new insights about language use 
and variation (Liu, 2011, 2012; Liu & Zhang, 2012), for example. 


9.2 Marrying (big) data science and the humanities 
and social sciences 


...creating opportunities for bringing social scientific and humanistic expertise 
into data science practice simultaneously will advance both data science and 
critical data studies. — Gina Neff, Anissa Tanweer, Brittany Fiore-Gartland & 
Laura Osburn (2017:85) 


A number of researchers have pointed out that it is not sufficient for data scientists to have 
skills and knowledge in the areas just summarised, but that a more holistic view of data 
science is called for which acknowledges the human perspective as well. In this respect, Blei 
and Smyth (2017: 8690) argue that since data science cannot be fully automated, “significant 
human judgment and deep disciplinary knowledge” are needed when scientific questions are 
posed. This is echoed by data scientist and machine learning researcher Matthew Mayo 
(2018:2): 


Human involvement, for the foreseeable future, is paramount, not only for 
overseeing and correcting [the] course for any level of automation, but also 
to kick off searches for insight. We may be able to automate exploratory 
investigations of what questions we should be looking to potentially apply 
the data science process to in the hopes of answering, and even have this 
phase augmented by facts and figures, but the human element will need to 
make nuanced decisions on which courses of action are worthy of pursuit. 


The so-called human angle cannot be ignored as this is exactly what is required to 
“understand the context of data, appreciate the responsibilities involved in using private 
and public data, and clearly communicate what a dataset can and cannot tell us about the 
world” (Blei & Smyth 2017:8691). Humanists and social scientists are uniquely positioned 
to meet these requirements and to resist the notion that (big) data is a monolith (cf. Bailey 
2016: 169). 

In this respect and in the context of so-called (big) “data for good projects” (Neff, 
Tanweer, Fiore-Gartland & Osburn 2017:85), there is an urgent call to improve research 
practices in both data science and critical data studies (CDS).!°° The aim of scholars such 


135 Critical data studies (CDS) are studies that interrogate the challenges of big data that pertain to 
culture, politics, and ethics (Dalton, Taylor & Thatcher 2016). These studies “[question] the many 
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as boyd and Crawford (2012), Provost and Fawcett (2013), Iliadis and Russo (2016), and 
Dalton et al. (2016), to name a few, is not to label data scientists as being either reflexive 
or unreflexive about their research, but to identify best practices that would serve these 
scientists as well as CDS researchers. In their ethnographic research on the culture and 
practices of data scientists, communication scholars Neff et al. (2017:85-86) have observed 
that many data scientists acknowledge the political, cultural, and ethical challenges inherent 
in producing and analysing data, although they tend not to be as explicit about these 
challenges as humanists and social scientists are. They therefore suggest that humanities and 
social science scholars be positioned in research teams so that “[they] can help data scientists 
understand the layers of social, organizational, political, ethical, and emotional complexities 
embedded in their work” (Neff et al. 2017:95).'°° They describe a series of conversations they 
initiated to bring together social scientists, librarians, science and technology study scholars, 
and data scientists to talk about aspects of their research such as privacy, transparency, 
and the democratisation of data science. What they discovered was that the data scientists 
appreciated the initiative not only because it fostered opportunities for collaborative sense 
making, but also because it improved their own critiques of data science. With respect to 
these critiques and in Neff et al. 5 (2017:85) view, (big) data science studies benefit from the 
acknowledgement that (1) communication should not be marginalised at any stage in the 
data science process; (2) sense making is a collaborative effort; (3) data is a starting point 
and not an end in itself; and (4) data originates from and is exchanged via sets of stories." 

In the world of big data, there is also an urgent call for data scientists to adhere to a 
number of fundamental principles that allow them to extract information and knowledge 
from complex datasets in principled ways (Provost & Fawcett 2013:56). These principles 
include but are not limited to (i) extracting useful information from datasets with a view 
to solving complex problems, (ii) being sensitive to the contexts in which their data-science 
results will be employed, (iii) using information technology to extract informative data items 
from a much larger body of data, (iv) paying careful attention to so-called “confounding” or 
even unseen factors before drawing causal conclusions, and (v) detecting and thus avoiding 
“overfitting” a dataset — that is, finding effects in a specific sample that cannot be generalised 
to a larger population (Provost & Fawcett 2013:560-57). 


assumptions about Big Data that permeate contemporary literature on information and society 
by locating instances where Big Data may be naively taken to denote objective and transparent 
informational entities” (Iliadis & Russo 2016:1). 


136 Kitchin & Lauriault (2014:1) refer to these crucial aspects of data science as the sociotechnical “data 
assemblages” of big data. 

137 Neff et al. (2017:94) refer to the field of engineering to illustrate what they mean by sets of stories: 
“the stories that data can tell begin with the stories that shape the production of data or the stories 
that help make sense of the potential desired outcome and need for data”. 
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An example: Big data analysis in the 
humanities in South Africa 


... [Wh]at questions are we asking of our big data sets, and what data are we 
using? The answers are important, and point to the need for a humanist’ touch in 
big data projects. — Alex Woodie (2015:1) 


10.1 Introduction 


This chapter uses big data methods discussed throughout this book to illustrate the value 
of using big data to answer questions posed in the humanities. We follow the data analytics 
project life cycle, which reflects the following stages: 

1. Identifying the problem 

Gathering the data 

Pre-processing the data 

Performing analytics on the data 


SAGES ee 


Visualising the results 


The following sections provide an illustrative example of a typical data analytics 
project life cycle using sentiment analysis. 


10.2 Identifying the problem 


The Afrikaner was placed in the spotlight in January 2018 when Hoérskool Overvaal in 
Gauteng was accused of racism after refusing to accept pupils who were not Afrikaans 
speaking (Mitchley 2018). Large-scale protests by the Economic Freedom Fighters (EFF) 
and the African National Congress (ANC) followed, but eventually the North Gauteng 
High Court turned down the appeal of Gauteng Education MEC'*’, Panyaza Lesufi, to have 
55 English-speaking pupils admitted to the school (Masinga 2018). 

Another major event took place on 28 February of that year when Parliament 
undertook to re-consider Article 25 of the South African Constitution to allow for the 
expropriation of land without compensation. The discourse centred on white/black identities, 
the ‘haves’ versus the ‘have-nots’, the privileged as opposed to the underprivileged, the land 


138 Member of the executive council. 
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thieves against the dispossessed, and the legacy of apartheid — the Afrikaner’s dominance of 
South African politics from 1948 to 1994 (Eloff 2017; Chung 2018; Mkokeli 2018; Osborne 
2018; Roelf, 2018). Julius Malema for instance remarked: “We must ensure that we restore 
the dignity of our people without compensating the criminals who stole our land” (Chung 
2018). On 4 December 2018, Parliament voted in favour of the decision to consider the 
possibility of amending the constitution. Eventually, ‘expropriation without compensation’ 
became South Africas catchphrase of the year (Grobler 2018; Sekhotho 2018). 

Coupled with the land issue, Afrikaners also made headlines through numerous 
incidents of racism and alleged racism. One of the episodes was the sentencing of Vicky 
Momberg for crimen injuria on 28 March 2018 following her racist tirade against black 
police officers (Fihlani 2018; Pijoos 2018; Ritchie 2018). After several other incidents, the 
year ended with a race row at Clifton Fourth Beach in Cape Town (Nombembe 2018; 
Chambers 2019). 

A further issue that put the Afrikaner in the spotlight was farm attacks. In May 2018, 
the Afrikaner civil rights group, AfriForum, visited the USA to obtain support for what it 
referred to as “white genocide” and a “fight against land expropriation without compensation” 
(Thamm 2018:1; cf. Du Toit 2018; Kriel 2018). Journalists such as Lauren Southern and 
Katie Hopkins made documentary films about farm attacks and the Afrikaner’s position in 
South Africa. Eventually, this media exposure led to a diplomatic incident between the ANC 
and Australia after Peter Dutton, Australias Home Affairs Minister, declared that white 
farmers should be welcomed in Australia (Gous 2018; Killalea, 2018), and then between 
the ANC and the US (Steinhauser 2018) after US President Donald Trump tweeted: “I 
have asked Secretary of State @SecPompeo to closely study the South Africa land and farm 
seizures and expropriations and the large scale killing of farmers”. 

But how is the Afrikaner generally regarded in post-apartheid South Africa? This 
is an important question given that the term ‘Afrikaner’ is not a homogenous one (Visagie 
2018:5). We argue that social media analytics, and more specifically sentiment analysis, may 
offer some insights into public sentiment and opinion surrounding the Afrikaner. 


10. 3 Data gathering 


In the literature on Afrikaner identity, the word ‘Afrikaner’ is highly contested. According to 
Verwey and Quayle (2012:555), “[many] prominent (and often dissident) Afrikaner writers 
have engaged with the question of whether Afrikaners are African’, for example. In this 
regard, they point to Breyten Breytenbach, who observed in 1983 that “he both belong[ed] 
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and [did] not belong to Africa”!” (Verwey & Quayle 2015:555) and to Max du Preez, who 
wrote in 2003 that he was both an African and an Afrikaner.'° 

Bearing in mind that the word ‘Afrikaner’ is fraught with contradictions, we 
collected tweets from Twitter that mentioned the word ‘Afrikaner’ from 6 November 2018 
to 6 February 2019. Out of 21379 users, a total of 56565 tweets were amassed, including 
23270 unique tweets. These tweets were generated at an average of 734.6 tweets per day by 
an average of 277.9 users per day. 

The first challenge was to filter out irrelevant tweets. What is problematic is that 
the word ‘Afrikaner’ is used to denote different groups in different languages. In Afrikaans, 
English and the African languages, the term refers to “a South African person whose family 
was originally Dutch and whose first language is Afrikaans” (the Cambridge Dictionary 
2019). The South African trade union Solidarity (2018:5) defines the Afrikaner in clearer 
terms as “mense wat hulle ras as wit en hulle taal as Afrikaans aangee” [“people who define 
themselves as white and have Afrikaans as a first language”].'“ Although not stated explicitly 
in racial terms, the Cambridge Dictionary’ reference to “a South African person whose family 
was originally Dutch” implies that this definition is similar to that used by Solidarity. We 
concede that it is difficult to find just one definition of the term. Steyn (2016:2) puts it 
eloquently when he says that it is a “slippery” act to use the term ‘Afrikaner’ to refer to all 
white people whose mother tongue is Afrikaans. Steyn (2016: 2) asks, “What, for example, 
should the test be to determine whiteness? [...] some ‘Afrikaners” choose to distance 
themselves from any identification with ‘other’ Afrikaners or with some form of ‘Afrikaner 
identity’, while other ‘Afrikaners’, in their everyday lives, speak little or no Afrikaans”. 

In German, Danish, Swedish, Norwegian, and Polish, on the other hand, the word 
‘Afrikaner’ is used to denote “someone of African descent”. To test dictionary definitions 
against colloquial use, the language of tweets was identified using Google Translation API, 
and then translated into English using the same API. A total of 77 languages were identified 
in this manner. Table 10.1 provides some sample tweets with an automatic English translation 
reflected in the right-hand column. What is also illustrated is that the dictionary definitions 
remain valid for German, Swedish, Norwegian, Danish, Polish, and Dutch, but that the 
word Afrikaner carries the same meaning in French as it does in English and Afrikaans. 


139 The true confessions of an Albino terrorist (Mariner Books). 
140 Pale native: Memories of a renegade reporter (Zebra Press). 


141 _ https://solidariteit.co.za/en/who-are-we/. 
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Table 10.1: The use of the word ‘Afrikaner’ in some sample languages 


Language 


Tweet 


English tanslation 


Tre afrikaner häktade för gruppvåldtäkt pa 
Södermalm - Nyheter Idag 


Three Africans arrested for gang rape in 


Södermalm - News Today 


En vacker och intelligent vit 22-åring tjej 


A beautiful and intelligent white 22-year- 


#toestemming! De kleinste Nederlandse 
kinderen gaan uitgescholden & belaagd 
worden op sinterklaasfeestjes. Hoe verschilt 
dat van belaagde #Afrikaner kleintjes op 
lagerescholen in #ZuidAfrika? - 


E med A i alla ämnen blir gruppvåldtagen old girl with A in all subjects are gang-raped 
$ | och mördad av ett gäng svarta afrikaner and murdered by a gang of black afrikaner 
? | @©O9@ Ar det detta ni vill? Ar det värt QOQO@S Is that what you want? Is it worth 
det med massinvandring? Dela om du vill se it to mass immigration? Share if you want to 
en förändring!® #svpol #migpol #sd2019 see a change! ©) #svpol #migpol # sd2019 # 
#AfS2019 AfS2019 
Hvis du aldri har blitt forbanna pa den If you've never been pissed off at the 
innvandringspolitikken som føres i landet så immigration policies pursued in the country 
E har du heller aldri lest en personlig fortelling you have never read a personal story of a 15 
Š fra ei 15 år gammel jente om hvordan det years old girl about how it feels to be raped 
S føles å voldtas med kniv mot strupen av en with a knife at the throat of an African. 
afrikaner. 
Freiburg Hauptschule ... ein Afrikaner schlägt | Freiburg main school ... an African beats a 
einem Deutschen Schüler das Pausenbrot aus German student’s lunchbox from the hand 
der Hand .. und schlägt Ihm in das Gesicht .. and hits him in the face .... the German 
g |en der Deutsche Schüler sagt Zitat : scheiss student says Quote: fucking nigger ... and 
g | Neger... und schlägt zurück ... Ergebnis ein fighting back ... Result an expulsion ... he 
È | Schulverweis ... er muss such eine neue Schule | must search to find a new school 
suchen 
Wenn Gott gewollt hatte,dass Europa ein Ort | If God had willed that Europe is a place for 
für Afrikaner sei,hatte er sie weiss gemacht.!!! Africans, he would have made them white.!!! 
Congolese(54) meegelift/carriére gemaakt Congolese (54) hitched / career made by left- 
bij linkse partij in Italié, stapt uit, richt wing party in Italy, get off, dir “African Party 
“Afrikaanse partij op alleen voor Afrikanen” for Africans only” 100% racist program, left 
met 100%racistisch programma, links juicht cheers 
E Zo begint de #terreur - en met officiele Thus begins the #terreur - and with official 
Q 


#toestemming! The smallest Dutch children 
are abused and attacked to Saint Nicholas 
parties. How is that different from endangered 
#Afrikaner children in lower schools 


#ZuidAfrika? - 
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Erytrejczyk i dwaj Somalijczycy aresztowani za 
gwalt zbiorowy w centrum Sztokholmu 


Eritrean and two Somalis arrested for gang 
rape in central Stockholm 


Niemcy. Migrant z Afryki subsaharyjskiej 


Germany. Migrant from sub-Saharan Africa 


Ee przeciął twarz młodej kobiety nożem tylko cut the face of a young woman with a 

œ~ | dlatego, ze nie miała papierosa! Kirchheim knife because she did not have a cigarette! 
Teck 19-letnia kobieta w sobotni wieczór o Kirchheim Teck 19-year-old woman on 
godzinie 23:15 została poważnie oszpecona Saturday night at 23:15 was severely disfigured 
nożem. knife. 
1/7 des afrikaners vie dans des bidonvilles 1/7 Afrikaners living in slums (500,000) 
(500 000 ) La politique actuelle est anti Current policy is anti white. Sure there are 
blanche. Effectivement il y a un certain some white power but this does not solve like 
pouvoir blanc mais cela ne doit pas de that. That’s why I take an African area on for 
résoudre comme cela . C’est pour ça que je Afrikaner and colored 

„= | prenne une zone de l'Afrique du sur pour les 

5 afrikaner et les coloured 

aa 


ce matin: Deuxiéme temps ce mardi de notre 
série sur l'Afrique du Sud. Aujourd’hui Victor 
Macé de Lépinay nous guide, en compagnie 
de ses deux invitées, dans le Voortrekker 
Monument de Pretoria, lieu symbolique de 
Phistoire afrikaner. 


this morning Second time on Tuesday in our 
series on South Africa. Today Victor Mace 
Lépinay to guide us with his two guests in the 
Voortrekker Monument in Pretoria, symbolic 
place of Afrikaner history. 


Table 10.2 shows the number of tweets collected by language. 


Table 10.2: Language distribution of tweets 


Number of users 


Language Number of Percentage of Percentage of 
tweets tweets users 

German 27917 49.22% 9192 42.96% 
English 16684 29.42% 9106 42.56% 
Spanish 3622 6.39% 390 1.82% 
Afrikaans 3017 5.32% 1083 5.06% 
Swedish 2729 4.81% 1272 5.95% 
Dutch 705 1.24% 501 2.34% 
Italian 187 0.33% 150 0.70% 
Frensch 168 0.30% 81 0.38% 
Danish 163 0.29% 140 0.65% 
Polish 119 0.21% 96 0.45% 
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Given that tweets in English constitute the largest relevant category, and since 
German tweets refer to a different understanding of ‘Afrikaners’, we focused our analysis on 
English tweets. 


10. 4 Text pre-processing 


With the introduction of user-generated content on Internet platforms, one of the main 
bottlenecks in any text analytics system is the handling of text data that is often noisy. Noisy 
text typically contains spelling errors, ad hoc abbreviations, contractions, improper casing, 
and incorrect punctuation (Dey & Haque 2009). The text should therefore be “cleansed” 
or pre-processed before analytical operations can be performed on the data. In the context 
of natural language processing, text pre-processing is an essential step in any text analytics 
project, since the words and characters identified during pre-processing will be passed on to 
subsequent processing stages. According to Palmer (2010:9), text [pre-processing] “is the task 
of converting raw text, which can be described as a sequence of digital bits, into well-defined 
sequences of linguistically meaningful units”. Twitter data is known to be particularly noisy 
and contains additional noise such as incomplete or poorly structured sentences, irregular 
expressions, ill-formed words and terms that do not appear in a dictionary Jianqiang & 
Xiaolin 2017). Before the text can be used for analysis, a series of pre-processing steps must 
be completed to reduce the amount of noise. These pre-processing series of steps include 
data cleansing, tokenisation, and syntactic parsing (Dey & Haque 2009). During our 
data cleansing phase, URLs were replaced with a tag ||HTTP_URL]| and targets (e.g. @ 
John) with tag ||AT_USER]|. Usernames (identified by @ sign) and any external links were 
removed. Elongated words, which are words in which one character is repeated multiple 
times (e.g., aaaaangry) were shortened. All punctuation marks (such as full stops, commas, 
question marks, exclamation marks or emoticons) as well as special characters ($, %, and, #, 
etc.) were removed. Re-tweets (i.e., tweets that are re-distributed and begin with ‘RT’) were 
also removed from the corpus. For tokenisation, we split text on white spaces!®, since all 
emoticons, HTML tags, URLs, re-tweets, and user mentions were removed from the text. 
Finally, all stop words (i.e., the most common words in a language such as ‘is’ and ‘the’) 
were removed from the corpus, and the remaining tokens were converted to lowercase. This 
included the words ‘Afrikaner’ and ‘Afrikaners’ since either of these words could be present 
in a tweet. 


142 White spaces are defined as “the markers that [separate] each word” (http://javadevnotes.com/java- 
string-split-space-or-whitespace-examples). In other words, they are the unused spaces between, for 
example, paragraphs or graphics. Tokenisation is the “process of breaking a sentence by the white 
spaces or any kind of special symbols” (Karpurapu & Jololian 2017:210). 
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10.5 Performing analytics on the data 


After the text was pre-processed and available in the required format, data analytics operations 
were performed. Typically, these data analytics operations are used to discover meaningful 
nuggets of knowledge from data or information. For the purpose of this illustrative example 


we make use of sentiment analysis. 


10.5.1 Introduction to sentiment analysis 


Sentiment analysis is a growing subfield of natural language processing (NLP), and is often 
combined with text analysis and computational linguistics to identify, extract, and study any 
subjective data source (Pang & Lee 2004). In the context of big data, these data sources, 
which are generally referred to as user-generated content (UGC), include blogs, tweets, 
and web reviews. In order to extract information from UGC about, for example, people’s 
opinions and attitudes about topics or products they purchase, sentiment analysis is regularly 
employed (Serrano-Guerrero, Olivas, Romero & Herrera-Viedma 2015:18), and entails 
classifying these opinions/attitudes as positive, negative or neutral. Sentiment analysis can 
also be used to monitor how global news events affect public opinion. For example, a real- 
time sentiment analysis model was used in a study by Wang, Can, Kazemzadeh, Bar and 
Narayanan (2012) to explore public opinion regarding the 2012 US presidential election. 
In a more recent study, the entire life-cycle of a large hydro project was assessed through 
examination of public opinion and critique (Jiang, Lin & Qiang 2015). Other real-world 
applications include We Feel (Milne, Paris, Christensen, Batterham & O’Dea 2015), which 
continually monitors Twitter for emotional content, as well as StockTwits and The Stock 
Sonar (Feldman, 2013), which provide traders, investors, and entrepreneurs with stock 
market advice. 

Based on a review of the literature, two main approaches to carrying out a sentiment 
analysis are lexicon-based and machine learning approaches. The lexicon-based approach is 
a rule-based approach that relies on an existing sentiment dictionary or lexicon. Specifically, 
an existing lexicon is employed which should contain the given word and its polarity (e.g., 
‘awesome’ is positive and ‘horrible’ is negative). When a new sentence is classified, words 
in the sentence are matched to words in the lexicon, and using pre-defined rules, the values 
are aggregated into a sentiment score for the sentence. The aggregation of positive or 
negative values produces the semantic orientation of the sentence (Taboada, 2016). Such an 
approach has been successfully used to analyse conventional texts such as blogs, forums, and 
product reviews (Turney 2002; Kim & Hovy 2004). Machine learning, on the other hand, 
analyses and interprets patterns or structures in data: it “allows the user to feed a computer 
algorithm an immense amount of data and have the computer analyse and make data-driven 
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recommendations and decisions based on only the input data”.'® Machine learning 
approaches make use of supervised or unsupervised methods to determine the polarity of 
texts (in the form of a document, sentence, or phrase). Supervised learning models require 
that the algorithm learn on a labelled dataset. These labels can be assigned manually by 
human evaluation or resources that explicitly define ratings (Giachanou & Crestani 2016). 
Examples of machine learning algorithms that classify texts as positive, negative or neutral 
include Naive Bayes (NB), Support Vector Machines (SVM), Maximum Entropy (MaxEnt), 
Random Forests, and Logistic Regression. By contrast, unsupervised learning learns from 
training data that is not labelled or classified; classification is based on fixed syntactic patterns 
(Turney 2002) or on a sentiment lexicon (Taboada, Brooke, Tofloski, Voll & Stede 2011). 


10.5.2 Applying sentiment analysis 


The sentiment analysis employed in the current study followed the lexicon-based approach 
owing to lack of sufficient training data. Additionally, the fact that the lexicon-based 
approach can be employed across different domains without changing the dictionaries 
makes it an attractive approach for Twitter sentiment analysis (Taboada et al. 2011). 
Furthermore, the lexicon-based approach has been shown to be particularly useful for 
analysing conventional texts such as blogs and tweets (Thelwall, Buckley & Paltoglou 2012; 
Mohammad, Kiritchenko & Zhu 2013). 

The sentiment analyser used two existing lexicon-based dictionaries, namely, Bing 
Lius Opinion Lexicon (Hu & Liu 2004), and the National Research Council (NRC)- 
Canada Hashtag Sentiment Lexicon (Mohammad eż al. 2013). The Opinion Lexicon 
consists of 6789 words and is divided into two lists of words; the one list contains positive 
(n = 2006) words and the other negative words (n = 4783). The NRC-Canada Hashtag 
Sentiment Lexicon consists of 54129 unigram words associated with positive and negative 
sentiment and was generated automatically from tweets with sentiment-word hashtags such 
as #amazing and #terrible. 

A Python 2.7 application was developed to handle the pre-processing and sentiment 
analysis of each tweet message collected. The sentiment score of each tweet was derived using 
the polarity scores of each word found in the lexicon dictionaries as mentioned earlier. We 
first computed a sentiment score by identifying the sentiment words in each of the sentiment 
lexicons. A semantic orientation score of +1 is assigned to a positive word and a semantic 
orientation score of -1 is assigned to a negative word. The sentiment score is then calculated 
as the sum of scores of its sentiment words divided by the number of scored words to produce 
an average score. The average score was then used in a classification method to classify the 


143 _ https://www.netapp.com/us/info/what-is-machine-learning-ml.aspx 
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polarity of the tweets into either positive, negative or neutral categories. The sentiment 
function would finally return a polarity score (positive/negative) between -1.0 and +1.0. 

Special care was also taken with negation and modifiers. Negation refers to the task 
of converting a positive to a negative (or the reverse) through special words such as ‘never’, 
‘no’, and ‘not’. For example, if ‘not good’ was detected in the dataset, the polarity score of the 
following word (i.e., ‘good’) was reversed. A similar approach was followed with modifiers 
such as ‘much’, ‘very’, and ‘really’ (as in ‘very happy’, for example). The polarity of the 
following word (i.e., happy’) was adjusted by a factor of 1.3. This would either increase or 
decrease the polarity score of the word. 


10.5.3 Sentiment analysis results 


Three sentiment classifiers were used and compared with one another. The classifiers 
included our own polarity classifier, AFINN' (Nielsen 2011), and Pattern for Python 
(De Smedt & Daelemans 2012). A threshold of zero was used to classify the tweets into 
positive, negative or neutral groupings. In other words, if the score was 0, which indicates no 
sentiment value, the tweet was classified as neutral. A score of +0.1 was considered positive, 
and a score of -0.1 was considered negative. The results of the sentiment analyser in terms 
of polarities are shown in Table 10.3. 


Table 10.3: Results of the sentiment analysis (without re-tweets, n = 4505) 


Method Lexicon Negative | Neutral | Positive Negative 
proportion (%) 
Own classifier | Opinion Lexicon 1508 1796 1201 33.5% 
Own classifier | NRC Hashtag 2595 1130 780 57.6% 
Sentiment Lexicon 
AFINN - 1705 1334 1466 37.8% 
Pattern - 758 2488 1259 16.8% 


Based on Table 10.3, it is evident that the results vary, but this is not unexpected 
since a lexicon is dependent on a particular domain. The Opinion Lexicon was extracted 
from customer reviews (that were not limited to 140 characters), and the Hashtag Sentiment 
Lexicon was generated from sentiment-word hashtags contained in a tweet. AFINN 
contains only 2477 words, while the Hashtag Sentiment Lexicon contains 54129 words, 
which means that it has fewer words with polarity scores. Figure 10.1 provides a breakdown 
of results obtained using these four classifiers. 


144 AFFINN is an affective lexicon developed by Finn Arup Nielsen, who is a senior researcher at the 
Technical University of Denmark. 
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Sentiment classifiers 


Opinion lexicon sentiment score NRC Hashtag sentiment score 
B Nega 
( j 
f 
AFINN sentiment score Pattern sentiment score 
@ Neutra 
@ Positive 


A 


O 


Negative 


Figure 10.1: A comparison of sentiment classifiers 


For the purpose of our study, the Hashtag Sentiment Lexicon was used: it contains the 


largest corpus (32048 positive and 22081 negative words) and was generated automatically 


from Twitter data which contained sentiment-related hashtags such as #terrible and 


#amazing. 


10.6 Visualising the results 


Using the NRC Hashtag Sentiment Lexicon sentiment classifier, we could investigate 


where users were registered and whether they tweeted about the Afrikaner in a positive 


or negative light. Figure 10.2 below illustrates the countries where users were located 


according to sentiment. We show only those countries with more than ten mentions of 


the word ‘Afrikaner’. 
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Figure 10.2: Tweets about the Afrikaner by country 


Figure 10.2 indicates that out of the 48 countries that referred to the Afrikaner, only 
ten mentioned the Afrikaner more than ten times. The donut chart in the top left-hand 
corner shows that 78% of users were based in South Africa, while the country with the 
second-most users who mentioned the Afrikaner was the United States with a contribution 
of 7.7% of all tweets collected. While the figure clearly illustrates that discourse about the 
Afrikaner was concentrated in South Africa, it also shows that users in countries to which 
a large number of South Africans have emigrated such as the United States, Australia, the 
United Kingdom, Canada and Ireland also contributed to the discourse. Keep in mind the 
2018 incidents relating to the political concerns that the US and Australia had with South 
Africa over Afrikaner issues mentioned in section 10.2. Countries with strong historical ties 
to the Afrikaner such as the Netherlands and Germany also contributed significantly to the 
discourse on the Afrikaner. Lastly, Namibia and Zimbabwe contributed to the discourse on 
the Afrikaner, which is understandable given these countries’ proximity to South Africa. 
Notably, the rest of Africa, South America, the Middle East, the Far East and Russia did 
not contribute to this discourse in a significant way. Overall, the figure shows that for 
Twitter users in neighbouring countries, countries with historical ties to South Africa, and 
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countries where a large number of Afrikaners have settled since 1994, the Afrikaner was 
relevant. Nevertheless, almost 80% of the discourse was confined to South Africa, which 
also illustrates a limited global relevance. 

Users in all these countries generally tweeted about the Afrikaner in a negative way: 
on average and as the donut chart at the bottom shows, approximately 58% of tweets about 
the Afrikaner were negative, while only 16.9% were positive. We checked the distribution 
of positive, negative, and neutral tweets per country, and all approximated this distribution. 
The most negative users were located in Canada, followed by users in Australia and the 
United Kingdom. Users in Germany and Zimbabwe were the least negative as indicated 
in green on the world map. The context in which these tweets were generated however is 
significant: we investigated what was said in a given context, and in most cases, the negative 
tweets referred to a negative context. In other words, negative tweets referred to the Afrikaner 
in a negative context (and not in a negative light) such as in the context of farm attacks, 
land expropriation or white poverty. In many cases, users were sympathetic to the Afrikaner, 
but it is the context that made the tweets negative. Figure 10.3 shows a few examples from 
outside South Africa. 


@ Monrmizes ares 


A truly worthy cause to help Afrikaner 
children 
Afrikaner Children Squatter Camps 


White farmers tortured with drills and 
blowtorches, Afrikaner rights group claims 


6 Anonymized 

>. a x 
South Africa: Surge in BRUTAL ATTACKS 
targeting white farmers - Afrikaner campaign 
group 


> m 
Victims were assaulted with electric drills, 
blowtorches and bleach, indicating “a racial 
element”, said Ernst Roets, of Afriforum, 
which champions the rights of the Afrikaner 
minority 


Figure 10.3: Examples of negative tweets 
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What we can see from the examples in Figure 10.3 is that the issues that brought the 
Afrikaner to international attention in 2018 contributed to the context in which the Afrikaner 
was regarded that year. The picture painted of Afrikaners on Twitter from outside South Africa 
is one in which the Afrikaner is seen as being at risk owing to poverty, economic exclusion, 
and farm attacks. As an aside, the issue of white poverty is so widely reported on outside South 
Africa that a Google Image search of the term ‘South African squatter camps’ delivers results 
such as the one in Figure 10.4. (The screenshot was taken on 20 February 2019 and any facial 
features have been blurred). 


O ge = m mame en 


l- ; x - E! m 
TEE a Pa ee 
sake ý a EA s 

Bel oe se: gee Gees IS y ad E 
i wpne ate a 
Hise ZE Be i EE 
SS Pca er 


Figure 10.4: Google Image search of ‘South African squatter camps’ 


What is fascinating — and disturbing — about this Google Image search is that 
when a Google spokesperson was asked why a search of ‘squatter camps in South Africa 
overwhelmingly yields the kinds of images seen in Figure 10.4, he replied, “Because our 
systems are surfacing and organising information and content from the web, [a] search 
can mirror biases or stereotypes that exist on the web and in the real world. [...] We [...] 
will continue to work, to improve image results for all of our users” (Tembo 2018:1).'* 
According to cyber security expert Catalin Cimpanu (2019:1), Google has structured its 
URL parameters in such a way that “allows threat actors'“° a way to essentially edit search 


145 We do not dismiss the existence of white informal dwellings which are well documented in the 
literature (Sibanda 2012; Kruger 2016). According to the South African Human Rights Commission’s 
Equality Report 2017/2018, 1% of whites in South Africa live in poverty. Approximately 64% of 
blacks, 41% of coloureds, and 6% of Indians are poverty-stricken. (See https://www.sahrc.org.za/ 
home/21/files/SAHRC%20Equality%20Report%202017_18.pdf). 


146 In cyber security, a threat actor is an individual who perpetrates malicious online acts. 
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results, which is a dangerous issue”. Journalist Lynsey Chutel (2018:2) points out that many 
South Africans were irate when they encountered the search results for ‘squatter camps in 
South Africa’, and took to social media to voice their opinions. Significantly, she observes that 
people’s anger at Google’s search results is misdirected, “since algorithms learn what humans 
teach them through their [behaviour]”. She adds that “[the] search results are a reflection on 
a broader conversation on race and poverty” (Chutel 2018:2). 

Figure 10.5 below shows examples of negative tweets from within South Africa. 
We found that in South Africa, Afrikaners were depicted as oppressors and as racist and 
corrupt, while the outside view was one of Afrikaners as victims of negative circumstances 


(hence the negativity score of tweets). 


Check out AFRIKANER corruption. | can't 
wait for AFRIKANER HYPOCRITES to comment 


@ Anonymiuzes Pere 
owe 

This is the reason | hate the Afrikaner 
Broederbond IC sellout Askaris. | will 
never vote for these sellouts, voting for them 
is voting for MC Skellembosch Boys, We 


are all in this mess because of them! $9 


This Afrikaner corporate corruption is getting 
out of hand now! Roet 


O Ancaywtaed poa 


That says everything about who you are as a 
human being. A racist hate filled sad, and very 
entitled Afrikaner who has not realised that 
Apartheid is over and this is not France. 


Anonyme d 
Afriforum's PR campaign to show they're not 
really Apartheid/Colonialism apologists who 
only serve Afrikaner interests 


****checks notes**** 
Afriforum assists Coloured Apartheid military 


collaborators in getting military veteran 
status 


Figure 10.5: Negative tweets from within South Africa 


139 


Chapter 10 


Expressed a little differently, while 58,4% of tweets from across the globe were negative 
when they referred to the Afrikaner, a significant nuance was detected: tweets from within 
South Africa portayed the Afrikaner in a negative light, while tweets from outside South Africa 
depicted the Afrikaner as inhabiting a negative space. These findings are noteworthy as they 
point to what Theunissen (2015:2) refers to as “incongruities and ambiguities” surrounding 
Afrikaner identity, which is not surprising since this concept is not stable or homogenous 
(Theunissen 2015:3; cf. Blaser 2012:11). We would like to emphasise that our study merely 
illustrates how big data may be employed to show how Afrikaners are depicted on Twitter in 
South Africa and abroad. Quoting a study by Pretorius (2014:21), we recognise that Twitter 
users both in and outside South Africa may frame or justify the experiences of Afrikaners 
such as those related to farm attacks as “a Boer genocide” or as “a form of colonial struggle/ 
restitution” that “remains rooted in totalising Afrikaner and black nationalisms respectively”. 
Referring to a cult of white victimhood, social media expert and journalist Hannelie Booysens 
(2018:68) argues that social media platforms constitute “the perfect vehicle” for instilling 
fears about issues such as farm attacks. Booysen (2018:68) observes that in terms of statistics, 
“farmers are no more likely to be victims of violent crime than any other demographic group” 
in South Africa.” A big data study of how Twitter users refer to Afrikaners also needs to 
include reasons why Afrikaners are framed as they are. We argue that humanists who choose 
to examine how Afrikaners are portrayed on social media platforms need to take into account 
that “complex cogitive processes and intergroup behaviour [are] at play” (Theunissen 2015:3) 
in Afrikaner identity. This brings us back to the important message we underscored in Chapter 
9 — that we need to understand the social, political, and economic dimenensions of any dataset 
under investigation because the dataset itself cannot help us make inferences about the world 


(Blei & Smyth 2017:8690). 


10.7 Conclusion 


While the study is merely an illustration and the dataset too small to generate generalisable 
conclusions, it nevertheless indicates that big data allows researchers in the humanities to 
ask new questions, and in novel ways. It shows how a humanities question such as How is 
the Afrikaner viewed on Twitter?, can be answered using big data. The study also signals that 
researchers need to be cautious about the authenticity of online data, since individuals and 
organisations may deliberately disseminate misinformation and disinformation, where the 
former refers to false information designed to deceive online viewers and the latter to false 
information that is generally geared towards propaganda that promotes a specific agenda or 
viewpoint (Kumar & Geethakumari 2014:3). 


147 This does not mean that we do not acknowledge that farm attacks occur. Indeed, according to the 
South African Police Service’s crime statistics, the most dangerous province for farm murders is 
Gauteng (Head 2018). According to Kate Wilkinson (2019:1) of Africa Check, two major problems 
that made it difficult to accurately calculate the farm murder rate in South Africa in 2015/2016 
were that (1) the South African Police Service's defininition of what a farm murder entails was vague 
and (2) a breakdown of the status of victims of farm attacks was not analysed by the South African 
Police Service. 
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While there is no need to educate social scientists and humanists ... in the technical 
detail and inner workings of ... [big] data tools, it is absolutely critical to introduce 
to them its capabilities and ability to trigger their imagination, assisting them in 
asking the “right questions.” — Dominic Lam (2014:4) 


It would be easy to discourse at length about the flaws reflected in a field that is not yet fully 
understood because it finds itself at the early stages of practice and experimentation. Yet, the 
many successful collaborations between humanists/social scientists and (big) data scientists 
discussed throughout this book point to the importance of not dismissing big data science 
unconditionally. Kaplan (2015:1) correctly observes that “[most] of the methods needed to 
study ... [large] datasets still need to be invented, as they are currently not mastered [either] 
by humanists or computer scientists”. We could add that social scientists too are feeling their 
way through big data. We closely examined many of the ontological and epistemological 
challenges of big data which van Dijck (2014:206) urges humanists and social scientists 
to address, particularly “[if] predictive analytics and real-time data analytics become the 
preferred modes of scientific analysis of human behavior”. Big data needs social scientists 
and humanists. ‘Reinventing the social scientist and humanist in the era of big data is not 
about these scholars being compelled to radically alter what they already do. Given that both 
the humanities and the social sciences are currently being hybridised with technological and 
data-driven research, reinvention is more about humanists and social scientists retooling 
themselves and working in collaboration with data scientists in order to respond to the 
era of big data. We call on these scholars not only to carry out “big data science for good” 
projects, but also to critically assess the societal, ethical, and political impacts of big data. We 
call on them to work closely with data scientists in order to advance a big data environment 
in which there is greater sensitivity to these impacts. Finally, and without wishing to sound 
contradictory, we urge humanists and social scientists to consider pursuing a small data 
research agenda within a big data world: “small data studies will continue to flourish because 
they have a proven track record of answering specific questions” (Kitchin & Lauriault 
2015:464).'48 What is changing and thus opening up small data to big data analytics is the 
pooling and linking of this data into data infrastructures in order to make data not only 
accessible and stimulating, but also transparent. 


148 In fact, Kitchin and Lauriault (2015:473) state that “[the] pressure to harmonize, share and reuse 
small data will continue to grow as research funders seek to gain the maximum return on their 
investment through new knowledge and innovations”. 
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This book explores the big data evolution by interrogating the notion that big data 
is a disruptive innovation that appears to be challenging existing epistemologies in 
the humanities and social sciences. Exploring various (controversial) facets of big data 
such as ethics, data power, and data justice, the book attempts to clarify the trajectory 
of the epistemology of (big) data-driven science in the humanities and social sciences. 


Susan Brokensha is a Senior Lecturer in the Department of English at the University of 
the Free State and has a PhD in Applied Language Studies. As a teacher, she is passionate 
about how the pedagogical value of learning management systems may be exploited 
to enhance teaching and learning. As a researcher, she is particularly interested in both 
linguistic and non-linguistic aspects of behaviour and communication in offline and online 
contexts. Against the backdrop of the fourth industrial revolution, she has recently begun 
interrogating how big data tools and technologies reflect perils and possibilities for scholars 
carrying out qualitative research in the humanities and social sciences. 


Eduan Kotzé holds a PhD in Computer Information Systems and is a Senior Lecturer and 
the Academic Departmental Head of the Department of Computer Science and Informatics 
at the University of the Free State. Dr Kotzé has over 15 years technical and managerial 
experience in Information Technology, specialising in database management systems, 
data warehousing, business intelligence, and text mining. His research focuses mainly on 
algorithms and methods to process big datasets while employing a data science approach. 
These datasets are predominately in unstructured natural language text formats. 


Burgert Senekal has been associated with the University of the Free State since 2008. After 
completing his Master’s degree in Afrikaans, Dr Senekal completed a Master's degree in 
English on contemporary British fiction. In 2013, he obtained his PhD on counterinsurgency 
literature at the University of the Free State. He has published more than 50 peer-reviewed 
articles and is an NRF-rated researcher. His research interests include alienation, information 
technology, big data, data science, and complex networks. 


S SUNBONANI 


SCHOLAR 


